-
Notifications
You must be signed in to change notification settings - Fork 443
conformance_test.go exit code to 0 event the non last test suite has failures #2265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
/assign |
@bingbing8 I'd like to ensure we address this. In your original report you mention 2 test suites, do you mean the one from upstream kubernetes and the one from capz? |
@jackfrancis (when you're back in the new year): here is an example of this in practice: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2941/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1603522577655926784 The github status reports success, even though the conformance suite had 1 failed spec. |
After an initial investigation, it appears that the cluster-api is itself wrapping kubetest inside a container execution lifecycle. And the above test output suggests that the underlying container, though its runtime reported 1 error, did not return a non-zero exit code. Specifically, I'm not seeing evidence that this conditional path was reached: If that had been reached, then the capz test runner would get this back: Which would trigger this: One thing that is interesting is that cluster-api is wrapping the error response back from the cc @sbueringer @fabriziopandini in case you have any context here |
We have the same issue here: kubernetes-sigs/cluster-api#5123 (comment) Essentially we never look into the junit report generated by the conformance tests. So we also don't fail our e2e test if there were errors in the conformance tests. I documented how to easily reproduce it. I think essentially we just need someone to pick up the work. |
@sbueringer that sounds like a lower-level explanation that would produce these outcomes. Let's keep this CAPZ issue open and move our practical energies towards fixing things to kubernetes-sigs/cluster-api#5123 |
So I ran a follow-up local test, and got an error during the conformance test:
That was eventually bubbled up thusly:
And then:
My CLI session confirmed:
Also worth noting that my comment above about "the kubetest container will always return zero" is definitely not correct:
It's arguable that "Unable to run conformance tests" is not the correct wrapping text, but regardless the underlying error that it wraps ("error container run failed with exit code 1") suggests that capi's kubetest abstractions are honoring the underlying kubetest runtime exit code. So... I'm kind of confused. |
Did you check if the failures show in junits reports? Eventually that's what gets bubbled up to Testgrid as the "green" pass |
So it looks like one has SuiteSucceeded |
In which case this would be a testgrid bug? (Or maybe we need to explicitly always only publish one junit file, or maybe we're misnaming the junit file?) |
I think this is not correct. While testgrid shows the junit results the junit results don't control if the test itself succeeds or not. This depends on what our e2e test reports back via its exit code. This is then stored in |
I wonder if you hit something else in #2265 (comment). This job reports something else: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2941/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1603522577655926784 You see the error logs from the conformance test and then our test returns PASS. And that happens as I described above because our test framework does not read the junit reports to decide if a test is failed or not. |
@sbueringer so you're suggesting that the exit code result is < 100% reliable in terms of terminal outcome, and we need to fall back to introspecting junit whenever we get an exit 0 because the actual non-successful terminal outcome may only exist in junit output? |
I didn't get a non-zero exit code when I reproduced it as I described it on the other issue. I don't know/ remember if we are even getting the exit code from the conformance tests |
This suggests that we (capi e2e) are getting the exit code from the conformance test. |
But that's not what we see in prow, right? |
In @CecileRobertMichon's linked repro prow did get a zero exit code:
@sbueringer This is why I wonder if your comment suggests the edge case we're after is actually:
Because we're only looking at the exit code from the kubetest runtime we report success. |
Yup, that's what I was trying to say. I think in your case here the e2e test would actually fail. |
@sbueringer agreed In which case, I think we may be misconfiguring kubetest? The configuration surface area is so vast, perhaps the errors observed in these scenarios are configured to be "swallowed" and only returned as historical details (in the junit output). Maybe we need to look more into how we configure these tests and ensure that for tests that we require to pass, we are bubbling them up to the exit code signal. |
Might be. I'm not 100% sure we actually retrieve the exit code of ginkgo correctly today. Up until a few weeks ago we were running the container even with restart. So it was restarting after the test was done.. xref: fix in CAPI: kubernetes-sigs/cluster-api#7946 Are you using a version which already has the fix? (e.g. v1.2.10 or v1.3.3) |
Interesting, have you observed a repro since that change? CAPZ update to v1.3.3 in-progress: #3136 |
I didn't observe one. But neither a case where a conformance test case failed and then the job failed (looks like the conformance tests were always green the last few days) |
@jackfrancis is this issue still active? |
@CecileRobertMichon I'm not aware of any active investigation into this. I audited this page's history of successful results and didn't encounter any that exhibited the documented symptoms here. Perhaps we should close this issue and re-open if we observe this repro in the future? |
I was able to find one where a conformance test case failed and the job actually failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-master/1650489167919976448 So looks like this is now fixed /close |
@CecileRobertMichon: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Uh oh!
There was an error while loading. Please reload this page.
/kind bug
[Before submitting an issue, have you checked the Troubleshooting Guide?]
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
./conformance_test.go
What did you expect to happen:
Normally there are 2 test suites (junit.e2e_suite.1.xml is the last) run. If the first one has failure but the second test suite (junit.e2e_suite.1.xml) passed. The exit code of conformance_test.go is 0. I expect the script exit with error to indicate there are failures in any of the test suites.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
): 1.23/etc/os-release
): ws2022The text was updated successfully, but these errors were encountered: