Skip to content

conformance_test.go exit code to 0 event the non last test suite has failures #2265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bingbing8 opened this issue Apr 27, 2022 · 32 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@bingbing8
Copy link

bingbing8 commented Apr 27, 2022

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
./conformance_test.go

What did you expect to happen:
Normally there are 2 test suites (junit.e2e_suite.1.xml is the last) run. If the first one has failure but the second test suite (junit.e2e_suite.1.xml) passed. The exit code of conformance_test.go is 0. I expect the script exit with error to indicate there are failures in any of the test suites.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • cluster-api-provider-azure version: 1.2
  • Kubernetes version: (use kubectl version): 1.23
  • OS (e.g. from /etc/os-release): ws2022
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 27, 2022
@bingbing8 bingbing8 changed the title conformance_test.go exit code to 0 when the last test suite pass conformance_test.go exit code to 0 event the non last test suite has failures Apr 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2022
@jackfrancis
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 24, 2022
@CecileRobertMichon
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 28, 2022
@jackfrancis
Copy link
Contributor

/assign

@jackfrancis jackfrancis added this to the v1.7 milestone Nov 28, 2022
@jackfrancis
Copy link
Contributor

@bingbing8 I'd like to ensure we address this. In your original report you mention 2 test suites, do you mean the one from upstream kubernetes and the one from capz?

@CecileRobertMichon
Copy link
Contributor

@jackfrancis (when you're back in the new year): here is an example of this in practice: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2941/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1603522577655926784

The github status reports success, even though the conformance suite had 1 failed spec.

image

@jackfrancis
Copy link
Contributor

After an initial investigation, it appears that the cluster-api is itself wrapping kubetest inside a container execution lifecycle. And the above test output suggests that the underlying container, though its runtime reported 1 error, did not return a non-zero exit code.

Specifically, I'm not seeing evidence that this conditional path was reached:

https://github.com/kubernetes-sigs/cluster-api/blob/v1.3.2/test/infrastructure/container/docker.go#L500-L502

If that had been reached, then the capz test runner would get this back:

https://github.com/kubernetes-sigs/cluster-api/blob/v1.3.2/test/framework/kubetest/run.go#L194-L196

Which would trigger this:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/v1.7.1/test/e2e/conformance_test.go#L238

One thing that is interesting is that cluster-api is wrapping the error response back from the RunContainer invocation in its kubetest package with this message: ""Unable to run conformance tests". Does that suggest that whomever implemented this understood that running a kubetest E2E flow in this way would always return zero after execution, regardless of the error response from the actual kubetest runtime? Are we meant to be introspecting for terminal kubetest execution state in some other way?

cc @sbueringer @fabriziopandini in case you have any context here

@sbueringer
Copy link
Member

sbueringer commented Feb 2, 2023

We have the same issue here: kubernetes-sigs/cluster-api#5123 (comment)

Essentially we never look into the junit report generated by the conformance tests. So we also don't fail our e2e test if there were errors in the conformance tests.

I documented how to easily reproduce it. I think essentially we just need someone to pick up the work.

@jackfrancis
Copy link
Contributor

@sbueringer that sounds like a lower-level explanation that would produce these outcomes.

Let's keep this CAPZ issue open and move our practical energies towards fixing things to kubernetes-sigs/cluster-api#5123

@jackfrancis
Copy link
Contributor

So I ran a follow-up local test, and got an error during the conformance test:

    [FAILED] error while waiting for pod svcaccounts-2356/oidc-discovery-validator to be Succeeded or Failed: pod "oidc-discovery-validator" failed with status: {Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2023-02-02 20:27:57 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2023-02-02 20:28:14 +0000 UTC Reason:PodFailed Message:} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2023-02-02 20:28:14 +0000 UTC Reason:PodFailed Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2023-02-02 20:27:57 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.1.0.5 PodIP:192.168.12.82 PodIPs:[{IP:192.168.12.82}] StartTime:2023-02-02 20:27:57 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:oidc-discovery-validator State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2023-02-02 20:28:01 +0000 UTC,FinishedAt:2023-02-02 20:28:12 +0000 UTC,ContainerID:containerd://ffadf26019d4692c5095d9f6399d9676437db1c29622d144365b902d91173bd9,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:registry.k8s.io/e2e-test-images/agnhost:2.43 ImageID:registry.k8s.io/e2e-test-images/agnhost@sha256:16bbf38c463a4223d8cfe4da12bc61010b082a79b4bb003e2d3ba3ece5dd5f9e ContainerID:containerd://ffadf26019d4692c5095d9f6399d9676437db1c29622d144365b902d91173bd9 Started:0xc003584c1d}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
    In [It] at: test/e2e/auth/service_accounts.go:637 @ 02/02/23 20:28:48.077

That was eventually bubbled up thusly:

  Summarizing 1 Failure:
    [FAIL] [sig-auth] ServiceAccounts [It] ServiceAccountIssuerDiscovery should support OIDC discovery of service account issuer [Conformance]
    test/e2e/auth/service_accounts.go:637
  
  Ran 338 of 7084 Specs in 2168.955 seconds
  FAIL! -- 337 Passed | 1 Failed | 0 Pending | 6746 Skipped

And then:

  [FAILED] in [It] - /Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:238 @ 02/02/23 13:02:03.43
• [FAILED] [4064.682 seconds]
Conformance Tests [It] conformance-tests
/Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:98

  [FAILED] Unexpected error:
      <*errors.withStack | 0xc0009e64c8>: {
          error: <*errors.withMessage | 0xc001908840>{
              cause: <*errors.errorString | 0xc000680710>{
                  s: "error container run failed with exit code 1",
              },
              msg: "Unable to run conformance tests",
          },
          stack: [0x5f43e59, 0x61f2927, 0x44e34fb, 0x44f79f8, 0x4077281],
      }
      Unable to run conformance tests: error container run failed with exit code 1
  occurred
  In [It] at: /Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:238 @ 02/02/23 13:02:03.43

  Full Stack Trace
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.glob..func3.2()
    	/Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:238 +0x18fa

My CLI session confirmed:

$ echo $?
2

Also worth noting that my comment above about "the kubetest container will always return zero" is definitely not correct:

Unable to run conformance tests: error container run failed with exit code 1

It's arguable that "Unable to run conformance tests" is not the correct wrapping text, but regardless the underlying error that it wraps ("error container run failed with exit code 1") suggests that capi's kubetest abstractions are honoring the underlying kubetest runtime exit code.

So... I'm kind of confused.

@CecileRobertMichon
Copy link
Contributor

Did you check if the failures show in junits reports? Eventually that's what gets bubbled up to Testgrid as the "green" pass

@jackfrancis
Copy link
Contributor

junit.zip

cc @CecileRobertMichon

@CecileRobertMichon
Copy link
Contributor

So it looks like one has SuiteSucceeded true while the other has false? seems like that may be the issue

@jackfrancis
Copy link
Contributor

In which case this would be a testgrid bug? (Or maybe we need to explicitly always only publish one junit file, or maybe we're misnaming the junit file?)

@sbueringer
Copy link
Member

sbueringer commented Feb 3, 2023

Did you check if the failures show in junits reports? Eventually that's what gets bubbled up to Testgrid as the "green" pass

I think this is not correct. While testgrid shows the junit results the junit results don't control if the test itself succeeds or not. This depends on what our e2e test reports back via its exit code. This is then stored in finished.json and e.g. shown by Spyglass (the "per-job" UI)

@sbueringer
Copy link
Member

I wonder if you hit something else in #2265 (comment).

This job reports something else: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2941/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1603522577655926784

You see the error logs from the conformance test and then our test returns PASS. And that happens as I described above because our test framework does not read the junit reports to decide if a test is failed or not.

@jackfrancis
Copy link
Contributor

@sbueringer so you're suggesting that the exit code result is < 100% reliable in terms of terminal outcome, and we need to fall back to introspecting junit whenever we get an exit 0 because the actual non-successful terminal outcome may only exist in junit output?

@sbueringer
Copy link
Member

I didn't get a non-zero exit code when I reproduced it as I described it on the other issue.

I don't know/ remember if we are even getting the exit code from the conformance tests

@jackfrancis
Copy link
Contributor

This suggests that we (capi e2e) are getting the exit code from the conformance test.

@sbueringer
Copy link
Member

But that's not what we see in prow, right?

@jackfrancis
Copy link
Contributor

In @CecileRobertMichon's linked repro prow did get a zero exit code:

+ EXIT_VALUE=0

@sbueringer This is why I wonder if your comment suggests the edge case we're after is actually:

  1. conformance test run (kubetest runtime) returned exit code 0
  2. a test failure was observed during the run and recorded in junit output

Because we're only looking at the exit code from the kubetest runtime we report success.

@sbueringer
Copy link
Member

Yup, that's what I was trying to say.

I think in your case here the e2e test would actually fail.

@jackfrancis
Copy link
Contributor

@sbueringer agreed

In which case, I think we may be misconfiguring kubetest? The configuration surface area is so vast, perhaps the errors observed in these scenarios are configured to be "swallowed" and only returned as historical details (in the junit output). Maybe we need to look more into how we configure these tests and ensure that for tests that we require to pass, we are bubbling them up to the exit code signal.

@sbueringer
Copy link
Member

sbueringer commented Feb 6, 2023

Might be. I'm not 100% sure we actually retrieve the exit code of ginkgo correctly today.

Up until a few weeks ago we were running the container even with restart. So it was restarting after the test was done..
Hm maybe that plays into what you're seeing. Could it be that the ContainerInspect is looking into the newly restarted container instead of the "run" which just failed?

xref: fix in CAPI: kubernetes-sigs/cluster-api#7946

Are you using a version which already has the fix? (e.g. v1.2.10 or v1.3.3)

@jackfrancis
Copy link
Contributor

Interesting, have you observed a repro since that change?

CAPZ update to v1.3.3 in-progress: #3136

@sbueringer
Copy link
Member

I didn't observe one. But neither a case where a conformance test case failed and then the job failed (looks like the conformance tests were always green the last few days)

@CecileRobertMichon
Copy link
Contributor

@jackfrancis is this issue still active?

@jackfrancis
Copy link
Contributor

@CecileRobertMichon I'm not aware of any active investigation into this.

I audited this page's history of successful results and didn't encounter any that exhibited the documented symptoms here. Perhaps we should close this issue and re-open if we observe this repro in the future?

@CecileRobertMichon
Copy link
Contributor

I was able to find one where a conformance test case failed and the job actually failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-master/1650489167919976448

So looks like this is now fixed

/close

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Closing this issue.

In response to this:

I was able to find one where a conformance test case failed and the job actually failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-master/1650489167919976448

So looks like this is now fixed

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Archived in project
Development

No branches or pull requests

6 participants