conformance_test.go exit code to 0 event the non last test suite has failures #2265

bingbing8 · 2022-04-27T22:37:08Z

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
./conformance_test.go

What did you expect to happen:
Normally there are 2 test suites (junit.e2e_suite.1.xml is the last) run. If the first one has failure but the second test suite (junit.e2e_suite.1.xml) passed. The exit code of conformance_test.go is 0. I expect the script exit with error to indicate there are failures in any of the test suites.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version: 1.2
Kubernetes version: (use kubectl version): 1.23
OS (e.g. from /etc/os-release): ws2022

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2022-07-26T23:52:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jackfrancis · 2022-07-27T09:45:08Z

/remove-lifecycle stale

k8s-triage-robot · 2022-10-25T10:39:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-11-24T11:12:02Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

CecileRobertMichon · 2022-11-28T18:14:02Z

/remove-lifecycle rotten

jackfrancis · 2022-11-28T18:15:05Z

/assign

jackfrancis · 2022-11-29T00:06:25Z

@bingbing8 I'd like to ensure we address this. In your original report you mention 2 test suites, do you mean the one from upstream kubernetes and the one from capz?

CecileRobertMichon · 2022-12-21T17:04:29Z

@jackfrancis (when you're back in the new year): here is an example of this in practice: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2941/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1603522577655926784

The github status reports success, even though the conformance suite had 1 failed spec.

jackfrancis · 2023-02-02T19:33:35Z

After an initial investigation, it appears that the cluster-api is itself wrapping kubetest inside a container execution lifecycle. And the above test output suggests that the underlying container, though its runtime reported 1 error, did not return a non-zero exit code.

Specifically, I'm not seeing evidence that this conditional path was reached:

https://github.com/kubernetes-sigs/cluster-api/blob/v1.3.2/test/infrastructure/container/docker.go#L500-L502

If that had been reached, then the capz test runner would get this back:

https://github.com/kubernetes-sigs/cluster-api/blob/v1.3.2/test/framework/kubetest/run.go#L194-L196

Which would trigger this:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/v1.7.1/test/e2e/conformance_test.go#L238

One thing that is interesting is that cluster-api is wrapping the error response back from the RunContainer invocation in its kubetest package with this message: ""Unable to run conformance tests". Does that suggest that whomever implemented this understood that running a kubetest E2E flow in this way would always return zero after execution, regardless of the error response from the actual kubetest runtime? Are we meant to be introspecting for terminal kubetest execution state in some other way?

cc @sbueringer @fabriziopandini in case you have any context here

sbueringer · 2023-02-02T20:34:00Z

We have the same issue here: kubernetes-sigs/cluster-api#5123 (comment)

Essentially we never look into the junit report generated by the conformance tests. So we also don't fail our e2e test if there were errors in the conformance tests.

I documented how to easily reproduce it. I think essentially we just need someone to pick up the work.

jackfrancis · 2023-02-02T20:41:42Z

@sbueringer that sounds like a lower-level explanation that would produce these outcomes.

Let's keep this CAPZ issue open and move our practical energies towards fixing things to kubernetes-sigs/cluster-api#5123

jackfrancis · 2023-02-03T00:07:31Z

So I ran a follow-up local test, and got an error during the conformance test:

    [FAILED] error while waiting for pod svcaccounts-2356/oidc-discovery-validator to be Succeeded or Failed: pod "oidc-discovery-validator" failed with status: {Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2023-02-02 20:27:57 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2023-02-02 20:28:14 +0000 UTC Reason:PodFailed Message:} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2023-02-02 20:28:14 +0000 UTC Reason:PodFailed Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2023-02-02 20:27:57 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.1.0.5 PodIP:192.168.12.82 PodIPs:[{IP:192.168.12.82}] StartTime:2023-02-02 20:27:57 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:oidc-discovery-validator State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2023-02-02 20:28:01 +0000 UTC,FinishedAt:2023-02-02 20:28:12 +0000 UTC,ContainerID:containerd://ffadf26019d4692c5095d9f6399d9676437db1c29622d144365b902d91173bd9,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:registry.k8s.io/e2e-test-images/agnhost:2.43 ImageID:registry.k8s.io/e2e-test-images/agnhost@sha256:16bbf38c463a4223d8cfe4da12bc61010b082a79b4bb003e2d3ba3ece5dd5f9e ContainerID:containerd://ffadf26019d4692c5095d9f6399d9676437db1c29622d144365b902d91173bd9 Started:0xc003584c1d}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
    In [It] at: test/e2e/auth/service_accounts.go:637 @ 02/02/23 20:28:48.077

That was eventually bubbled up thusly:

  Summarizing 1 Failure:
    [FAIL] [sig-auth] ServiceAccounts [It] ServiceAccountIssuerDiscovery should support OIDC discovery of service account issuer [Conformance]
    test/e2e/auth/service_accounts.go:637
  
  Ran 338 of 7084 Specs in 2168.955 seconds
  FAIL! -- 337 Passed | 1 Failed | 0 Pending | 6746 Skipped

And then:

  [FAILED] in [It] - /Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:238 @ 02/02/23 13:02:03.43
• [FAILED] [4064.682 seconds]
Conformance Tests [It] conformance-tests
/Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:98

  [FAILED] Unexpected error:
      <*errors.withStack | 0xc0009e64c8>: {
          error: <*errors.withMessage | 0xc001908840>{
              cause: <*errors.errorString | 0xc000680710>{
                  s: "error container run failed with exit code 1",
              },
              msg: "Unable to run conformance tests",
          },
          stack: [0x5f43e59, 0x61f2927, 0x44e34fb, 0x44f79f8, 0x4077281],
      }
      Unable to run conformance tests: error container run failed with exit code 1
  occurred
  In [It] at: /Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:238 @ 02/02/23 13:02:03.43

  Full Stack Trace
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.glob..func3.2()
    	/Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:238 +0x18fa

My CLI session confirmed:

$ echo $?
2

Also worth noting that my comment above about "the kubetest container will always return zero" is definitely not correct:

Unable to run conformance tests: error container run failed with exit code 1

It's arguable that "Unable to run conformance tests" is not the correct wrapping text, but regardless the underlying error that it wraps ("error container run failed with exit code 1") suggests that capi's kubetest abstractions are honoring the underlying kubetest runtime exit code.

So... I'm kind of confused.

CecileRobertMichon · 2023-02-03T00:16:48Z

Did you check if the failures show in junits reports? Eventually that's what gets bubbled up to Testgrid as the "green" pass

jackfrancis · 2023-02-03T00:21:32Z

junit.zip

cc @CecileRobertMichon

CecileRobertMichon · 2023-02-03T00:27:39Z

So it looks like one has SuiteSucceeded true while the other has false? seems like that may be the issue

jackfrancis · 2023-02-03T00:29:21Z

In which case this would be a testgrid bug? (Or maybe we need to explicitly always only publish one junit file, or maybe we're misnaming the junit file?)

sbueringer · 2023-02-03T06:23:22Z

Did you check if the failures show in junits reports? Eventually that's what gets bubbled up to Testgrid as the "green" pass

I think this is not correct. While testgrid shows the junit results the junit results don't control if the test itself succeeds or not. This depends on what our e2e test reports back via its exit code. This is then stored in finished.json and e.g. shown by Spyglass (the "per-job" UI)

sbueringer · 2023-02-03T06:27:59Z

I wonder if you hit something else in #2265 (comment).

This job reports something else: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2941/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1603522577655926784

You see the error logs from the conformance test and then our test returns PASS. And that happens as I described above because our test framework does not read the junit reports to decide if a test is failed or not.

jackfrancis · 2023-02-03T18:05:23Z

@sbueringer so you're suggesting that the exit code result is < 100% reliable in terms of terminal outcome, and we need to fall back to introspecting junit whenever we get an exit 0 because the actual non-successful terminal outcome may only exist in junit output?

sbueringer · 2023-02-03T18:32:40Z

I didn't get a non-zero exit code when I reproduced it as I described it on the other issue.

I don't know/ remember if we are even getting the exit code from the conformance tests

jackfrancis · 2023-02-03T18:57:55Z

This suggests that we (capi e2e) are getting the exit code from the conformance test.

sbueringer · 2023-02-03T19:30:34Z

But that's not what we see in prow, right?

jackfrancis · 2023-02-03T19:43:50Z

In @CecileRobertMichon's linked repro prow did get a zero exit code:

+ EXIT_VALUE=0

@sbueringer This is why I wonder if your comment suggests the edge case we're after is actually:

conformance test run (kubetest runtime) returned exit code 0
a test failure was observed during the run and recorded in junit output

Because we're only looking at the exit code from the kubetest runtime we report success.

sbueringer · 2023-02-06T07:04:42Z

Yup, that's what I was trying to say.

I think in your case here the e2e test would actually fail.

jackfrancis · 2023-02-06T17:09:05Z

@sbueringer agreed

In which case, I think we may be misconfiguring kubetest? The configuration surface area is so vast, perhaps the errors observed in these scenarios are configured to be "swallowed" and only returned as historical details (in the junit output). Maybe we need to look more into how we configure these tests and ensure that for tests that we require to pass, we are bubbling them up to the exit code signal.

sbueringer · 2023-02-06T19:07:17Z

Might be. I'm not 100% sure we actually retrieve the exit code of ginkgo correctly today.

Up until a few weeks ago we were running the container even with restart. So it was restarting after the test was done..
Hm maybe that plays into what you're seeing. Could it be that the ContainerInspect is looking into the newly restarted container instead of the "run" which just failed?

xref: fix in CAPI: kubernetes-sigs/cluster-api#7946

Are you using a version which already has the fix? (e.g. v1.2.10 or v1.3.3)

jackfrancis · 2023-02-06T19:23:17Z

Interesting, have you observed a repro since that change?

CAPZ update to v1.3.3 in-progress: #3136

sbueringer · 2023-02-07T06:40:33Z

I didn't observe one. But neither a case where a conformance test case failed and then the job failed (looks like the conformance tests were always green the last few days)

CecileRobertMichon · 2023-04-25T22:53:42Z

@jackfrancis is this issue still active?

jackfrancis · 2023-04-25T23:00:46Z

@CecileRobertMichon I'm not aware of any active investigation into this.

I audited this page's history of successful results and didn't encounter any that exhibited the documented symptoms here. Perhaps we should close this issue and re-open if we observe this repro in the future?

CecileRobertMichon · 2023-04-25T23:05:59Z

I was able to find one where a conformance test case failed and the job actually failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-master/1650489167919976448

So looks like this is now fixed

/close

k8s-ci-robot · 2023-04-25T23:06:04Z

@CecileRobertMichon: Closing this issue.

In response to this:

I was able to find one where a conformance test case failed and the job actually failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-master/1650489167919976448

So looks like this is now fixed

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 27, 2022

bingbing8 changed the title ~~conformance_test.go exit code to 0 when the last test suite pass~~ conformance_test.go exit code to 0 event the non last test suite has failures Apr 27, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 24, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 28, 2022

k8s-ci-robot assigned jackfrancis Nov 28, 2022

jackfrancis added this to the v1.7 milestone Nov 28, 2022

jackfrancis modified the milestones: v1.7, next Dec 15, 2022

CecileRobertMichon mentioned this issue Dec 21, 2022

Update conformance test for Ginkgo v2 #2941

Merged

3 tasks

CecileRobertMichon added this to CAPZ Planning Apr 5, 2023

k8s-ci-robot closed this as completed Apr 25, 2023

github-project-automation bot moved this to Done in CAPZ Planning Apr 25, 2023

CecileRobertMichon removed this from the next milestone May 4, 2023

enxebre mentioned this issue Jun 30, 2023

test: passed test suite contains failed test cases kubernetes-sigs/cluster-api#5123

Closed

conformance_test.go exit code to 0 event the non last test suite has failures #2265

conformance_test.go exit code to 0 event the non last test suite has failures #2265

Comments

bingbing8 commented Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

k8s-triage-robot commented Jul 26, 2022

Uh oh!

jackfrancis commented Jul 27, 2022

Uh oh!

k8s-triage-robot commented Oct 25, 2022

Uh oh!

k8s-triage-robot commented Nov 24, 2022

Uh oh!

CecileRobertMichon commented Nov 28, 2022

Uh oh!

jackfrancis commented Nov 28, 2022

Uh oh!

jackfrancis commented Nov 29, 2022

Uh oh!

CecileRobertMichon commented Dec 21, 2022

Uh oh!

jackfrancis commented Feb 2, 2023

Uh oh!

sbueringer commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackfrancis commented Feb 2, 2023

Uh oh!

jackfrancis commented Feb 3, 2023

Uh oh!

CecileRobertMichon commented Feb 3, 2023

Uh oh!

jackfrancis commented Feb 3, 2023

Uh oh!

CecileRobertMichon commented Feb 3, 2023

Uh oh!

jackfrancis commented Feb 3, 2023

Uh oh!

sbueringer commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbueringer commented Feb 3, 2023

Uh oh!

jackfrancis commented Feb 3, 2023

Uh oh!

sbueringer commented Feb 3, 2023

Uh oh!

jackfrancis commented Feb 3, 2023

Uh oh!

sbueringer commented Feb 3, 2023

Uh oh!

jackfrancis commented Feb 3, 2023

Uh oh!

sbueringer commented Feb 6, 2023

Uh oh!

jackfrancis commented Feb 6, 2023

Uh oh!

sbueringer commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackfrancis commented Feb 6, 2023

Uh oh!

sbueringer commented Feb 7, 2023

Uh oh!

CecileRobertMichon commented Apr 25, 2023

Uh oh!

jackfrancis commented Apr 25, 2023

Uh oh!

CecileRobertMichon commented Apr 25, 2023

Uh oh!

k8s-ci-robot commented Apr 25, 2023

Uh oh!

bingbing8 commented Apr 27, 2022 •

edited

Loading

sbueringer commented Feb 2, 2023 •

edited

Loading

sbueringer commented Feb 3, 2023 •

edited

Loading

sbueringer commented Feb 6, 2023 •

edited

Loading