-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[CI] Builds failing due to Gradle test executor crash #52610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-core-infra (:Core/Infra/Build) |
This is still happening intermittently. The two cases today so far are: |
Another one recently: |
Another one on 7.x intake https://gradle-enterprise.elastic.co/s/xaekqb5ejkpmc |
Another (7.x)
|
Another (master) |
Another occurrence on 7.x: https://gradle-enterprise.elastic.co/s/sjj3dzchr5nau
|
@mark-vieira Any suggestion on how we can track down these failures? They appear to still be happening. Perhaps @breskeby could take a look? |
Do we have any way to reproduce one of those mentioned cases? I can have a closer look to figure out what's the root cause but it's hard with no way of reproducing and it seems they occurre quite rarely (but enough to be distrusting. If you see those, please link a gradle enterprise link to the according Build. |
@breskeby I believe this only happens in CI. I have never seen it locally. Here is one of those failures from today: |
My current guess is that we have internal cluster tests spinning nodes up on ports conflicting with either the daemon or test executor. I tweaked the way we capture the build logs for the GCP upload to include the full junit xml reports (which include stdout). The intention was to see if I can find the daemon/worker port anywhere in those logs. I haven't had a chance to actually go back and investigate after adding the logging though. |
I doubt this issue is related to conflicts of ports. My best guess at the moment is that something in the tests crashes the test executor jvm (e.g. calling System.exit(1) explicitly). Another potential issue might be tests conflicting or using a custom SecurityManager. Looking at some test fixtures I see some explicit System.exit(1) calls in there. I'll dig a bit deeper into this direction. |
When I initially looked at the daemon logs there were messages about failing to communicate to the test worker. We had an identical issue with a regression in a recent Gradle release regarding local interface binding. |
We might see similar failures with different root causes the. With the latest build Ryan mentioned, the daemon logs look fine
… On 13. Aug 2020, at 18:32, Mark Vieira ***@***.***> wrote:
When I initially looked at the daemon logs there were messages about failing to communicate to the test worker. We had an identical issue with a regression in a recent Gradle release regarding local interface binding.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Another case in https://gradle-enterprise.elastic.co/s/koolcoru744pi
|
Another case in https://gradle-enterprise.elastic.co/s/uo3zx5kuf6yak
|
Another case in https://gradle-enterprise.elastic.co/s/ez5axqidlgvxk
|
Yet another case in https://gradle-enterprise.elastic.co/s/cayul3sa6ufow
|
Another one: https://gradle-enterprise.elastic.co/s/2uspiahitbuwm
|
Another one: https://gradle-enterprise.elastic.co/s/kcx3euig3ntfc
|
This is definitely one of the worse non-test related build issues we have. Still at a loss for ideas as to what is causing this or how best to mitigate. |
@breskeby this is starting to be really problematic. We have at least a few of these every day. Any thoughts? |
No idea tbh what's causing this. I can invest some time next week to look into this. its hard to diagnose as it happens so irregular |
Or some other mitigation? It would be great if we could retry in this scenario but I think the issue is the test worker is dying. |
I'm wondering if we could take some inspiration from the test retry plugin to reexecute when we encounter a connect exception? https://github.com/gradle/test-retry-gradle-plugin |
Related to #52610 this PR introduces a rerun of all tests for a test task if the test jvm has crashed because of a system exit. We furthermore log potential tests that caused the System.exit based on which tests have been active at the time of the system exit. We also modified the build scan logic to track unexpected test jvm exists with the tag `unexpected-test-jvm-exit`
Related to elastic#52610 this PR introduces a rerun of all tests for a test task if the test jvm has crashed because of a system exit. We furthermore log potential tests that caused the System.exit based on which tests have been active at the time of the system exit. We also modified the build scan logic to track unexpected test jvm exists with the tag `unexpected-test-jvm-exit`
This should be addressed by #71881. |
Related to #52610 this PR introduces a rerun of all tests for a test task if the test jvm has crashed because of a system exit. We furthermore log potential tests that caused the System.exit based on which tests have been active at the time of the system exit. We also modified the build scan logic to track unexpected test jvm exists with the tag `unexpected-test-jvm-exit`
@mark-vieira found this issue on test triage today googling the exception line, so maybe you want to take another look even though this is closed? Build: https://gradle-enterprise.elastic.co/s/3zxtbqrj2zivg
|
Yeah, we had a fix but it caused other issues so we reverted. Given that, I think it makes sense to reopen this for visibility. |
Happened again: https://gradle-enterprise.elastic.co/s/oerbc2dcndaf6
|
We saw something different but similar failure today here: https://gradle-enterprise.elastic.co/s/eicalr3x7xio4/console-log?task=:x-pack:plugin:spatial:test |
And another another one on 7.x: https://gradle-enterprise.elastic.co/s/hwjqosndrc5ms/console-log?task=:x-pack:plugin:data-streams:test |
Another one: https://gradle-enterprise.elastic.co/s/f3iqid2d5utco |
These failures today on 7.15 and 7.16 are of the same nature: |
Failed on master intake https://gradle-enterprise.elastic.co/s/fgvnxyqde5day/console-log#L5001 |
Failed for 7.16 arm64 https://gradle-enterprise.elastic.co/s/k7lmeapnpv7ya/console-log#L2682 |
Pinging @elastic/es-delivery (Team:Delivery) |
Here is another failure of this nature, this time with a SEGV in the JVM:
|
Where did you find that error output? I couldn't find anything relevant in the GCP upload. |
How did I not see that? 🤦 We can start by making sure we capture |
The last batch of these were caused by an interaction between Lucene and a JVM bug. That's be resolved so I'm going to close this for now. We can open a new issue if this starts happening again for some other cause. |
We are seeing occasional instances of errors that look like this:
Looking at the console and daemon logs we see these exceptions:
The strange thing about these messages is that none of the ports mentioned here are either the daemon server nor the Gradle client. So who's talking to who here and where are they getting these ports from? My other though was that maybe one of these is an ES test cluster node but none of the testclusters are using these ports. Perhaps interner test clusters?
I'm going to reach out to the folks at Gradle to try and track this down.
The text was updated successfully, but these errors were encountered: