Skip to content

Multiple Test Failures from Blocked accept0 Syscalls (Debian CI Runs only) #43387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
original-brownbear opened this issue Jun 19, 2019 · 66 comments
Assignees
Labels
:Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

The following build: https://scans.gradle.com/s/4prwec7zf6pba/ failed in a very strange way.
Seemingly all nodes of the internal test cluster keep getting stuck in in accept calls on non-blocking server sockets.
The build log is full of failed connections and the following stuck thread reporting:

1> [2019-06-19T16:22:22,985][WARN ][o.e.t.n.MockNioTransport ] [node_sm1] Potentially blocked execution on network thread [elasticsearch[node_sm1][transport_worker][T#2]] [2401 milliseconds]:
--
1> sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
1> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
1> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory$$Lambda$1624/1917852998.run(Unknown Source)
1> java.security.AccessController.doPrivileged(Native Method)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory.accept(ChannelFactory.java:223)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory.acceptNioChannel(ChannelFactory.java:180)
1> org.elasticsearch.nio.ChannelFactory.acceptNioChannel(ChannelFactory.java:55)
1> org.elasticsearch.nio.ServerChannelContext.acceptChannels(ServerChannelContext.java:47)
1> org.elasticsearch.nio.EventHandler.acceptChannel(EventHandler.java:45)
1> org.elasticsearch.transport.nio.TestEventHandler.acceptChannel(TestEventHandler.java:51)
1> org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:227)
1> org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
1> org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
1> org.elasticsearch.nio.NioSelectorGroup$$Lambda$1545/393896253.run(Unknown Source)
1> java.lang.Thread.run(Thread.java:748)

It is not immediately clear to me how we could get into these calls blocking. It doesn't seem to be dead locks on some selector lock since no thread leaks are reported on the failing tests (though it could be that us interruption the node's thread pools clears all the stuck sys calls up). So far this seems to be a one time thing as far as I can see.

@original-brownbear original-brownbear added :Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI labels Jun 19, 2019
@original-brownbear original-brownbear self-assigned this Jun 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@henningandersen
Copy link
Contributor

I am not sure, but this might be related to: #37543, where we also had threads hanging inside accept for no apparent reason. #39031 tried to improve this, but there were still issues after that.

@original-brownbear
Copy link
Member Author

@henningandersen it seems the linked issue only affects 6.x though right? This happened on 7.x where we have a different network implementation based on non-blocking IO.

That said, I'm wondering if our close logic that potentially closes channels outside the transport thread cloud be at fault here again ... it seems it is possible for us to close the selector on a non-transport thread, which may lock when simultaneously accepting a connection ... what is very weird about this though is it happening across multiple threads :D If this was just one thread that might be a thing ... but all transport threads locking up simultaneously is hard to explain that way?

@original-brownbear
Copy link
Member Author

original-brownbear commented Jun 19, 2019

One interesting find regarding the close behavior I came across now is that if I block for like 10s in org.elasticsearch.common.concurrent.CompletableContext#addListener (inside the listener) then my selector loop goes into a stat where it can't even be interrupted it seems:

Jun 19, 2019 9:29:22 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
WARNING: Will linger awaiting termination of 1 leaked thread(s).
Jun 19, 2019 9:29:27 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
SEVERE: 1 thread leaked from SUITE scope at org.elasticsearch.action.admin.HotThreadsIT: 
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
        at java.base@12/sun.nio.ch.KQueue.poll(Native Method)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:122)
        at java.base@12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base@12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
        at app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:162)
        at app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
        at app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1602/0x00000008016cd840.run(Unknown Source)
        at java.base@12/java.lang.Thread.run(Thread.java:835)
Jun 19, 2019 9:29:27 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
INFO: Starting to interrupt leaked threads:
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
Jun 19, 2019 9:29:30 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
SEVERE: There are still zombie threads that couldn't be terminated:
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
        at java.base@12/sun.nio.ch.IOUtil.drain(Native Method)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.clearInterrupt(KQueueSelectorImpl.java:298)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.processEvents(KQueueSelectorImpl.java:231)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:139)
        at java.base@12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base@12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
        at app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:162)
        at app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
        at app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1602/0x00000008016cd840.run(Unknown Source)
        at java.base@12/java.lang.Thread.run(Thread.java:835)

com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.elasticsearch.action.admin.HotThreadsIT: 
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
        at java.base@12/sun.nio.ch.KQueue.poll(Native Method)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:122)
        at java.base@12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base@12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
        at app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:162)
        at app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
        at app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1602/0x00000008016cd840.run(Unknown Source)
        at java.base@12/java.lang.Thread.run(Thread.java:835)

	at __randomizedtesting.SeedInfo.seed([A17BF834B35181F4]:0)

REPRODUCE WITH: ./gradlew null --tests "org.elasticsearch.action.admin.HotThreadsIT" -Dtests.seed=A17BF834B35181F4 -Dtests.locale=en-DE -Dtests.timezone=Europe/Berlin

com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie threads that couldn't be terminated:
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
        at java.base@12/sun.nio.ch.IOUtil.drain(Native Method)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.clearInterrupt(KQueueSelectorImpl.java:298)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.processEvents(KQueueSelectorImpl.java:231)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:139)
        at java.base@12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base@12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
        at app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:162)
        at app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
        at app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1602/0x00000008016cd840.run(Unknown Source)
        at java.base@12/java.lang.Thread.run(Thread.java:835)

	at __randomizedtesting.SeedInfo.seed([A17BF834B35181F4]:0)

(note that this is OSX though and KQueue might show different behavior here than EPoll). I'll have to look into this some more, though a short pause causing this isn't so great either since we can always have those on CI ... still this might not be related

@henningandersen
Copy link
Contributor

Yeah, I am also in doubt, but find multiple overlapping symptoms:

  • Happened to multiple threads in the JVM.
  • Accept involved
  • Close maybe involved here too.

But clearly there are many differences (network implementation, blocking vs non-blocking and more). I am not saying it is related, but wanted to bring this up in case it brings about new ideas of where the problem lies.

@original-brownbear
Copy link
Member Author

Happened to multiple threads in the JVM.

This is by far the weirdest and most suspicious part. Those threads each come with their own selector and I don't see those transport threads sharing any monitor either ever ... but I also don't know of any system/kernel/... issue that would cause this behavior.

@DaveCTurner
Copy link
Contributor

@original-brownbear
Copy link
Member Author

#43424 should help track this down quickly I hope :)

@henningandersen
Copy link
Contributor

The two new and the original failure are all on debian 8, just like the similar failures in #24457.

@original-brownbear
Copy link
Member Author

original-brownbear commented Jun 20, 2019

@atorok @mark-vieira do you know if anything changed about our Debian 8 environment lately that would explain this issue now coming up? (change in Kernel version or generally the environment coming back).

EDIT: I guess this is also a function of the CPUs used ... probably impossible to track down in a Cloud env ... sorry for the noise

@henningandersen
Copy link
Contributor

Debian 8 had a new 3.16.0-9 kernel on 2019-06-17. I think this matches when we first saw this?

@original-brownbear
Copy link
Member Author

@henningandersen yea at least possibly, first observed case was on the 19th (yesterday) as far as I can see.

@original-brownbear
Copy link
Member Author

Maybe we should take Debian 8 out of rotation?

@henningandersen
Copy link
Contributor

#24457 has failures from June 14th, so not too conclusive on that part.

@original-brownbear
Copy link
Member Author

Still, the fact that it's Debian 8 only makes me not want to dig into the codebase too much here. With the rate of failures we're seeing that is hardly a coincidence IMO.

@mark-vieira
Copy link
Contributor

@atorok @mark-vieira do you know if anything changed about our Debian 8 environment lately that would explain this issue now coming up?

Not that I'm aware of. Perhaps on the infra side but certainly not in the Elasticsearch-specific CI images.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 21, 2019
* Relates elastic#43387 which appears to run into blocking accept calls
@original-brownbear
Copy link
Member Author

@henningandersen the stuck threads are RUNNABLE (just in case you haven't seen that one yet :)):

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.2+multijob-unix-compatibility/os=debian-8/67/consoleText

@alpar-t
Copy link
Contributor

alpar-t commented Jun 21, 2019

We can remove the stable label from Debian so we at least don't use it for PR and intake builds to reduce the noise.
If our tests are the reflection of what we support, does that mean that we should amend it on the support matrix too ? Removing it from rotation temporarily is fine, but we should have a better understanding of if/how this would affect production use-cases and if there's some kernel version we can't work with document that, and make sure CI has one that works.

Our CI workers did pick up the kernel upgrade:

3.16.0-9-amd64 #1 SMP Debian 3.16.68-2 (2019-06-17) x86_64 GNU/Linux

@original-brownbear
Copy link
Member Author

@atorok it seems we have this issue pre and post that kernel upgrade, unfortunately.

The production code shouldn't be affected since we run a different networking implementation in tests (prod. uses Netty, tests our own NIO implementation). This is assuming there isn't an issue here that in principle breaks accept0. So far I have literally not found a single report of accept0 hanging in Netty or Java NIO that looks like it's the same problem we're running into so I'm a little less worried that there is a production issue here.

I'll try to see if failing REST tests on Debian 8 reveal anything (let's see if we may have a lot of timeouts on network messages there).

Whether or not to take it out of intake builds I don't know. It fails like twice a day, and the issue should be easily traceable to this one? Not sure how bad the bit of noise is.

original-brownbear added a commit that referenced this issue Jun 21, 2019
* Assert ServerSocketChannel is not Blocking

* Relates #43387 which appears to run into blocking accept calls
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 21, 2019
* Assert ServerSocketChannel is not Blocking

* Relates elastic#43387 which appears to run into blocking accept calls
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 21, 2019
* Assert ServerSocketChannel is not Blocking

* Relates elastic#43387 which appears to run into blocking accept calls
@henningandersen
Copy link
Contributor

Since this is currently not actionable (have no recent failures), we decided to close this. To be reopened if this reoccurs once #47118 has ensured we are back to same frequency of debian runs.

Notice that a new debian 8 kernel is out. It is highly uncertain though if this fixes this issue...

@original-brownbear
Copy link
Member Author

@alpar-t
Copy link
Contributor

alpar-t commented Nov 11, 2019

A quick re-cap: We are fairly certain this only happens with our test networking infrastructure, we have not seen evidence that this could be a problem in production.
Even if some evidence would occur, it would be a testimony that there's something wrong with the OS or low level Java networking, and not sure there would be much we could do about it.
We feel that since it seems this is a test only issue, it's not worth going into an in depth OS level investigation, we'll just remove Debian 8 and 9 from the generic CI pool.
We will run specific Unix compatibility on these OS-es and we might still see the problem there.

We have seen similar problems, including on Ubuntu, that were relating to the kernel audit FW and auditbeat. More specifically it didn't seem that auditbeat was at fault, just any audit client calling into the kernel. We solve this at the time by reducing the allowed rate limit and this has worked on Ubuntu.
I will run some tests to see if the audit settings have any effect.

@mark-vieira
Copy link
Contributor

We feel that since it seems this is a test only issue, it's not worth going into an in depth OS level investigation, we'll just remove Debian 8 and 9 from the generic CI pool.
We will run specific Unix compatibility on these OS-es and we might still see the problem there.

Then we need a way to ignore tests on certain operating system variants. Is this something that exists today (I assume we have windows-only or linux-only tests)?

@alpar-t
Copy link
Contributor

alpar-t commented Nov 12, 2019

We do have a way, it's clumsy, but we can comment the OS from the matrix test.
It's not a good solution here however because we don't have a solution in sight, not are we thinking on investing one at this time. Completely stopping testing on the platform would be analogous to removing it from the support matrix, which we don't want to do.
The most granular we can go is to mute all internal cluster tests on a specific OS.

@alpar-t
Copy link
Contributor

alpar-t commented Nov 12, 2019

From the testing so far:

None of the failures encountered seem to relate to this issue so far.

@mark-vieira
Copy link
Contributor

We do have a way, it's clumsy, but we can comment the OS from the matrix test.

I was thinking more of some kind of annotation you'd place on an individual unit test.

@alpar-t
Copy link
Contributor

alpar-t commented Nov 13, 2019

After more than 24 hours of runs, the problem didn't reproduce, not once.
I'm giving it one last try on new workers, all with the default audit settings:

master:

7.x:

@original-brownbear
Copy link
Member Author

original-brownbear commented Nov 13, 2019

@atorok that would confirm my suspicion that this is a GCP worker/box specific problem in some form.

I had a few cases where I could repeatedly reproduce it on one worker but failed to reproduce it over hours on another. Thinking about this some more I wonder if this may just be a GCP problem with their Debian kernels. Could we maybe build Debian images some other way (using different kernels?) to test Debian?

What it boils down to in the end still is, a non-blocking (according to docs and whatnot :)) sys-call is blocking in one specific environment. We verified that it actually is blocking by adding logging that ensures we're not just calling the accept0 in a loop. We also are not doing anything non-standard that may cause a dead-lock as far as I can see (and a lot of time was spent trying to find issues here ...). Kernel logs also didn't show anything in the failed runs I was able to reproduce, neither did netstat -s, maybe we should give up here and report this to GCP with what limited intel we have on the matter?

@alpar-t
Copy link
Contributor

alpar-t commented Nov 13, 2019

I agree. I also observed the same behavior. It might have some timing aspect to it as well, I noticed that if it failed on on VM it was more likely to fail on others too, but not on all. I noticed this when testing against multiple VMs at the same time.

@alpar-t
Copy link
Contributor

alpar-t commented Nov 13, 2019

Since we already removed debian 8 and 9 from the general pool, I'm just going to close this.

@alpar-t alpar-t closed this as completed Nov 13, 2019
@alpar-t
Copy link
Contributor

alpar-t commented Nov 13, 2019

Here's the GCP issue: https://issuetracker.google.com/issues/144375215

@mark-vieira
Copy link
Contributor

This continues to affect internal cluster test suites on Debian 8. Since that OS continues to be part of our support matrix we need to either fix the underlying issue (seems unlikely) or ignore these tests.

@pgomulka
Copy link
Contributor

pgomulka commented Jan 7, 2020

failed today on 7.x and debian
https://gradle-enterprise.elastic.co/s/mu6k6btb2dxna

@original-brownbear
Copy link
Member Author

I've raised this with Infra now. Maybe we can fix this by simply moving the Debian tests to different testing infrastructure.

@Tim-Brooks
Copy link
Contributor

I just want to note that when I looked into this yesterday and ran a bunch of searches using the Gradle enterprise stuff, it looked to me that Debian8 and Debian9 had the characteristic consistent failures due to things hanging (MockNioTransport obviously logged, netty and nio transport instances just mysteriously hung.) I did not see these types of failures for Debian10. I'm not sure if there is something obviously different in how the infrastructure is setup there.

@mark-vieira
Copy link
Contributor

Maybe we can fix this by simply moving the Debian tests to different testing infrastructure.

Can you elaborate on what you mean by "different testing infrastructure"?

@original-brownbear
Copy link
Member Author

@mark-vieira Sorry totally missed your comment: see linked infra issue on "different testing infrastructure" point :)
Sadly, I don't have a clear definition of how different it has to be. Testing on different platform/Cloud-provider seems the easiest solution though as I suggest in that issue.

@ywelsch
Copy link
Contributor

ywelsch commented Feb 18, 2020

There's nothing we can do on the ES side here, which is why I'm closing this. We have raised this with the infra team. In case anyone is hitting this issue on test triage (Debian CI worker only), feel free to just ignore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests