-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Multiple Test Failures from Blocked accept0 Syscalls (Debian CI Runs only) #43387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-distributed |
@henningandersen it seems the linked issue only affects That said, I'm wondering if our |
One interesting find regarding the
(note that this is OSX though and KQueue might show different behavior here than EPoll). I'll have to look into this some more, though a short pause causing this isn't so great either since we can always have those on CI ... still this might not be related |
Yeah, I am also in doubt, but find multiple overlapping symptoms:
But clearly there are many differences (network implementation, blocking vs non-blocking and more). I am not saying it is related, but wanted to bring this up in case it brings about new ideas of where the problem lies. |
This is by far the weirdest and most suspicious part. Those threads each come with their own selector and I don't see those transport threads sharing any monitor either ever ... but I also don't know of any system/kernel/... issue that would cause this behavior. |
#43424 should help track this down quickly I hope :) |
The two new and the original failure are all on debian 8, just like the similar failures in #24457. |
@atorok @mark-vieira do you know if anything changed about our Debian 8 environment lately that would explain this issue now coming up? (change in Kernel version or generally the environment coming back). EDIT: I guess this is also a function of the CPUs used ... probably impossible to track down in a Cloud env ... sorry for the noise |
Debian 8 had a new 3.16.0-9 kernel on 2019-06-17. I think this matches when we first saw this? |
@henningandersen yea at least possibly, first observed case was on the 19th (yesterday) as far as I can see. |
Maybe we should take Debian 8 out of rotation? |
#24457 has failures from June 14th, so not too conclusive on that part. |
Still, the fact that it's Debian 8 only makes me not want to dig into the codebase too much here. With the rate of failures we're seeing that is hardly a coincidence IMO. |
Not that I'm aware of. Perhaps on the infra side but certainly not in the Elasticsearch-specific CI images. |
* Relates elastic#43387 which appears to run into blocking accept calls
@henningandersen the stuck threads are |
We can remove the Our CI workers did pick up the kernel upgrade:
|
@atorok it seems we have this issue pre and post that kernel upgrade, unfortunately. The production code shouldn't be affected since we run a different networking implementation in tests (prod. uses Netty, tests our own NIO implementation). This is assuming there isn't an issue here that in principle breaks I'll try to see if failing REST tests on Debian 8 reveal anything (let's see if we may have a lot of timeouts on network messages there). Whether or not to take it out of intake builds I don't know. It fails like twice a day, and the issue should be easily traceable to this one? Not sure how bad the bit of noise is. |
* Assert ServerSocketChannel is not Blocking * Relates #43387 which appears to run into blocking accept calls
* Assert ServerSocketChannel is not Blocking * Relates elastic#43387 which appears to run into blocking accept calls
* Assert ServerSocketChannel is not Blocking * Relates elastic#43387 which appears to run into blocking accept calls
Since this is currently not actionable (have no recent failures), we decided to close this. To be reopened if this reoccurs once #47118 has ensured we are back to same frequency of debian runs. Notice that a new debian 8 kernel is out. It is highly uncertain though if this fixes this issue... |
It's back https://gradle-enterprise.elastic.co/s/xx2un5eyjltnc :( |
A quick re-cap: We are fairly certain this only happens with our test networking infrastructure, we have not seen evidence that this could be a problem in production. We have seen similar problems, including on Ubuntu, that were relating to the kernel audit FW and auditbeat. More specifically it didn't seem that auditbeat was at fault, just any audit client calling into the kernel. We solve this at the time by reducing the allowed rate limit and this has worked on Ubuntu. |
Then we need a way to ignore tests on certain operating system variants. Is this something that exists today (I assume we have windows-only or linux-only tests)? |
We do have a way, it's clumsy, but we can comment the OS from the matrix test. |
From the testing so far: None of the failures encountered seem to relate to this issue so far. |
I was thinking more of some kind of annotation you'd place on an individual unit test. |
@atorok that would confirm my suspicion that this is a GCP worker/box specific problem in some form. I had a few cases where I could repeatedly reproduce it on one worker but failed to reproduce it over hours on another. Thinking about this some more I wonder if this may just be a GCP problem with their Debian kernels. Could we maybe build Debian images some other way (using different kernels?) to test Debian? What it boils down to in the end still is, a non-blocking (according to docs and whatnot :)) sys-call is blocking in one specific environment. We verified that it actually is blocking by adding logging that ensures we're not just calling the |
I agree. I also observed the same behavior. It might have some timing aspect to it as well, I noticed that if it failed on on VM it was more likely to fail on others too, but not on all. I noticed this when testing against multiple VMs at the same time. |
Since we already removed debian 8 and 9 from the general pool, I'm just going to close this. |
Here's the GCP issue: https://issuetracker.google.com/issues/144375215 |
This continues to affect internal cluster test suites on Debian 8. Since that OS continues to be part of our support matrix we need to either fix the underlying issue (seems unlikely) or ignore these tests. |
failed today on 7.x and debian |
I've raised this with Infra now. Maybe we can fix this by simply moving the Debian tests to different testing infrastructure. |
I just want to note that when I looked into this yesterday and ran a bunch of searches using the Gradle enterprise stuff, it looked to me that Debian8 and Debian9 had the characteristic consistent failures due to things hanging (MockNioTransport obviously logged, netty and nio transport instances just mysteriously hung.) I did not see these types of failures for Debian10. I'm not sure if there is something obviously different in how the infrastructure is setup there. |
Can you elaborate on what you mean by "different testing infrastructure"? |
@mark-vieira Sorry totally missed your comment: see linked infra issue on "different testing infrastructure" point :) |
There's nothing we can do on the ES side here, which is why I'm closing this. We have raised this with the infra team. In case anyone is hitting this issue on test triage (Debian CI worker only), feel free to just ignore. |
The following build: https://scans.gradle.com/s/4prwec7zf6pba/ failed in a very strange way.
Seemingly all nodes of the internal test cluster keep getting stuck in in accept calls on non-blocking server sockets.
The build log is full of failed connections and the following stuck thread reporting:
It is not immediately clear to me how we could get into these calls blocking. It doesn't seem to be dead locks on some selector lock since no thread leaks are reported on the failing tests (though it could be that us interruption the node's thread pools clears all the stuck sys calls up). So far this seems to be a one time thing as far as I can see.
The text was updated successfully, but these errors were encountered: