Multiple Test Failures from Blocked accept0 Syscalls (Debian CI Runs only) #43387

original-brownbear · 2019-06-19T17:56:08Z

The following build: https://scans.gradle.com/s/4prwec7zf6pba/ failed in a very strange way.
Seemingly all nodes of the internal test cluster keep getting stuck in in accept calls on non-blocking server sockets.
The build log is full of failed connections and the following stuck thread reporting:

1> [2019-06-19T16:22:22,985][WARN ][o.e.t.n.MockNioTransport ] [node_sm1] Potentially blocked execution on network thread [elasticsearch[node_sm1][transport_worker][T#2]] [2401 milliseconds]:
--
1> sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
1> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
1> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory$$Lambda$1624/1917852998.run(Unknown Source)
1> java.security.AccessController.doPrivileged(Native Method)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory.accept(ChannelFactory.java:223)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory.acceptNioChannel(ChannelFactory.java:180)
1> org.elasticsearch.nio.ChannelFactory.acceptNioChannel(ChannelFactory.java:55)
1> org.elasticsearch.nio.ServerChannelContext.acceptChannels(ServerChannelContext.java:47)
1> org.elasticsearch.nio.EventHandler.acceptChannel(EventHandler.java:45)
1> org.elasticsearch.transport.nio.TestEventHandler.acceptChannel(TestEventHandler.java:51)
1> org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:227)
1> org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
1> org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
1> org.elasticsearch.nio.NioSelectorGroup$$Lambda$1545/393896253.run(Unknown Source)
1> java.lang.Thread.run(Thread.java:748)

It is not immediately clear to me how we could get into these calls blocking. It doesn't seem to be dead locks on some selector lock since no thread leaks are reported on the failing tests (though it could be that us interruption the node's thread pools clears all the stuck sys calls up). So far this seems to be a one time thing as far as I can see.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-06-19T17:56:10Z

Pinging @elastic/es-distributed

henningandersen · 2019-06-19T18:41:18Z

I am not sure, but this might be related to: #37543, where we also had threads hanging inside accept for no apparent reason. #39031 tried to improve this, but there were still issues after that.

original-brownbear · 2019-06-19T19:19:11Z

@henningandersen it seems the linked issue only affects 6.x though right? This happened on 7.x where we have a different network implementation based on non-blocking IO.

That said, I'm wondering if our close logic that potentially closes channels outside the transport thread cloud be at fault here again ... it seems it is possible for us to close the selector on a non-transport thread, which may lock when simultaneously accepting a connection ... what is very weird about this though is it happening across multiple threads :D If this was just one thread that might be a thing ... but all transport threads locking up simultaneously is hard to explain that way?

original-brownbear · 2019-06-19T19:35:04Z

One interesting find regarding the close behavior I came across now is that if I block for like 10s in org.elasticsearch.common.concurrent.CompletableContext#addListener (inside the listener) then my selector loop goes into a stat where it can't even be interrupted it seems:

Jun 19, 2019 9:29:22 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
WARNING: Will linger awaiting termination of 1 leaked thread(s).
Jun 19, 2019 9:29:27 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
SEVERE: 1 thread leaked from SUITE scope at org.elasticsearch.action.admin.HotThreadsIT: 
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
        at java.base@12/sun.nio.ch.KQueue.poll(Native Method)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:122)
        at java.base@12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base@12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
        at app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:162)
        at app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
        at app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1602/0x00000008016cd840.run(Unknown Source)
        at java.base@12/java.lang.Thread.run(Thread.java:835)
Jun 19, 2019 9:29:27 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
INFO: Starting to interrupt leaked threads:
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
Jun 19, 2019 9:29:30 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
SEVERE: There are still zombie threads that couldn't be terminated:
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
        at java.base@12/sun.nio.ch.IOUtil.drain(Native Method)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.clearInterrupt(KQueueSelectorImpl.java:298)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.processEvents(KQueueSelectorImpl.java:231)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:139)
        at java.base@12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base@12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
        at app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:162)
        at app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
        at app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1602/0x00000008016cd840.run(Unknown Source)
        at java.base@12/java.lang.Thread.run(Thread.java:835)

com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.elasticsearch.action.admin.HotThreadsIT: 
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
        at java.base@12/sun.nio.ch.KQueue.poll(Native Method)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:122)
        at java.base@12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base@12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
        at app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:162)
        at app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
        at app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1602/0x00000008016cd840.run(Unknown Source)
        at java.base@12/java.lang.Thread.run(Thread.java:835)

	at __randomizedtesting.SeedInfo.seed([A17BF834B35181F4]:0)

REPRODUCE WITH: ./gradlew null --tests "org.elasticsearch.action.admin.HotThreadsIT" -Dtests.seed=A17BF834B35181F4 -Dtests.locale=en-DE -Dtests.timezone=Europe/Berlin

com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie threads that couldn't be terminated:
   1) Thread[id=36, name=elasticsearch[node_sc3][transport_worker][T#2], state=RUNNABLE, group=TGRP-HotThreadsIT]
        at java.base@12/sun.nio.ch.IOUtil.drain(Native Method)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.clearInterrupt(KQueueSelectorImpl.java:298)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.processEvents(KQueueSelectorImpl.java:231)
        at java.base@12/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:139)
        at java.base@12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base@12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
        at app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:162)
        at app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
        at app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1602/0x00000008016cd840.run(Unknown Source)
        at java.base@12/java.lang.Thread.run(Thread.java:835)

	at __randomizedtesting.SeedInfo.seed([A17BF834B35181F4]:0)

(note that this is OSX though and KQueue might show different behavior here than EPoll). I'll have to look into this some more, though a short pause causing this isn't so great either since we can always have those on CI ... still this might not be related

henningandersen · 2019-06-19T19:36:31Z

Yeah, I am also in doubt, but find multiple overlapping symptoms:

Happened to multiple threads in the JVM.
Accept involved
Close maybe involved here too.

But clearly there are many differences (network implementation, blocking vs non-blocking and more). I am not saying it is related, but wanted to bring this up in case it brings about new ideas of where the problem lies.

original-brownbear · 2019-06-19T19:40:30Z

Happened to multiple threads in the JVM.

This is by far the weirdest and most suspicious part. Those threads each come with their own selector and I don't see those transport threads sharing any monitor either ever ... but I also don't know of any system/kernel/... issue that would cause this behavior.

DaveCTurner · 2019-06-20T14:46:26Z

Other instances:

https://scans.gradle.com/s/xvdtt4bhikeye
https://scans.gradle.com/s/uqz54z4tqv6pa

original-brownbear · 2019-06-20T14:49:15Z

#43424 should help track this down quickly I hope :)

henningandersen · 2019-06-20T18:41:41Z

The two new and the original failure are all on debian 8, just like the similar failures in #24457.

original-brownbear · 2019-06-20T18:45:28Z

@atorok @mark-vieira do you know if anything changed about our Debian 8 environment lately that would explain this issue now coming up? (change in Kernel version or generally the environment coming back).

EDIT: I guess this is also a function of the CPUs used ... probably impossible to track down in a Cloud env ... sorry for the noise

henningandersen · 2019-06-20T18:59:43Z

Debian 8 had a new 3.16.0-9 kernel on 2019-06-17. I think this matches when we first saw this?

original-brownbear · 2019-06-20T19:02:22Z

@henningandersen yea at least possibly, first observed case was on the 19th (yesterday) as far as I can see.

original-brownbear · 2019-06-20T19:03:27Z

Maybe we should take Debian 8 out of rotation?

henningandersen · 2019-06-20T19:06:12Z

#24457 has failures from June 14th, so not too conclusive on that part.

original-brownbear · 2019-06-20T19:08:31Z

Still, the fact that it's Debian 8 only makes me not want to dig into the codebase too much here. With the rate of failures we're seeing that is hardly a coincidence IMO.

mark-vieira · 2019-06-20T23:17:35Z

@atorok @mark-vieira do you know if anything changed about our Debian 8 environment lately that would explain this issue now coming up?

Not that I'm aware of. Perhaps on the infra side but certainly not in the Elasticsearch-specific CI images.

* Relates elastic#43387 which appears to run into blocking accept calls

original-brownbear · 2019-06-21T14:13:27Z

@henningandersen the stuck threads are RUNNABLE (just in case you haven't seen that one yet :)):

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.2+multijob-unix-compatibility/os=debian-8/67/consoleText

alpar-t · 2019-06-21T14:38:38Z

We can remove the stable label from Debian so we at least don't use it for PR and intake builds to reduce the noise.
If our tests are the reflection of what we support, does that mean that we should amend it on the support matrix too ? Removing it from rotation temporarily is fine, but we should have a better understanding of if/how this would affect production use-cases and if there's some kernel version we can't work with document that, and make sure CI has one that works.

Our CI workers did pick up the kernel upgrade:

3.16.0-9-amd64 #1 SMP Debian 3.16.68-2 (2019-06-17) x86_64 GNU/Linux

original-brownbear · 2019-06-21T14:46:16Z

@atorok it seems we have this issue pre and post that kernel upgrade, unfortunately.

The production code shouldn't be affected since we run a different networking implementation in tests (prod. uses Netty, tests our own NIO implementation). This is assuming there isn't an issue here that in principle breaks accept0. So far I have literally not found a single report of accept0 hanging in Netty or Java NIO that looks like it's the same problem we're running into so I'm a little less worried that there is a production issue here.

I'll try to see if failing REST tests on Debian 8 reveal anything (let's see if we may have a lot of timeouts on network messages there).

Whether or not to take it out of intake builds I don't know. It fails like twice a day, and the issue should be easily traceable to this one? Not sure how bad the bit of noise is.

* Assert ServerSocketChannel is not Blocking * Relates #43387 which appears to run into blocking accept calls

* Assert ServerSocketChannel is not Blocking * Relates elastic#43387 which appears to run into blocking accept calls

henningandersen · 2019-09-25T19:34:28Z

Since this is currently not actionable (have no recent failures), we decided to close this. To be reopened if this reoccurs once #47118 has ensured we are back to same frequency of debian runs.

Notice that a new debian 8 kernel is out. It is highly uncertain though if this fixes this issue...

original-brownbear · 2019-11-11T10:04:59Z

It's back https://gradle-enterprise.elastic.co/s/xx2un5eyjltnc :(

alpar-t · 2019-11-11T12:51:06Z

A quick re-cap: We are fairly certain this only happens with our test networking infrastructure, we have not seen evidence that this could be a problem in production.
Even if some evidence would occur, it would be a testimony that there's something wrong with the OS or low level Java networking, and not sure there would be much we could do about it.
We feel that since it seems this is a test only issue, it's not worth going into an in depth OS level investigation, we'll just remove Debian 8 and 9 from the generic CI pool.
We will run specific Unix compatibility on these OS-es and we might still see the problem there.

We have seen similar problems, including on Ubuntu, that were relating to the kernel audit FW and auditbeat. More specifically it didn't seem that auditbeat was at fault, just any audit client calling into the kernel. We solve this at the time by reducing the allowed rate limit and this has worked on Ubuntu.
I will run some tests to see if the audit settings have any effect.

mark-vieira · 2019-11-11T23:57:36Z

We feel that since it seems this is a test only issue, it's not worth going into an in depth OS level investigation, we'll just remove Debian 8 and 9 from the generic CI pool.
We will run specific Unix compatibility on these OS-es and we might still see the problem there.

Then we need a way to ignore tests on certain operating system variants. Is this something that exists today (I assume we have windows-only or linux-only tests)?

alpar-t · 2019-11-12T07:50:57Z

We do have a way, it's clumsy, but we can comment the OS from the matrix test.
It's not a good solution here however because we don't have a solution in sight, not are we thinking on investing one at this time. Completely stopping testing on the platform would be analogous to removing it from the support matrix, which we don't want to do.
The most granular we can go is to mute all internal cluster tests on a specific OS.

alpar-t · 2019-11-12T07:55:57Z

From the testing so far:

None of the failures encountered seem to relate to this issue so far.

mark-vieira · 2019-11-12T18:23:34Z

We do have a way, it's clumsy, but we can comment the OS from the matrix test.

I was thinking more of some kind of annotation you'd place on an individual unit test.

alpar-t · 2019-11-13T09:00:39Z

After more than 24 hours of runs, the problem didn't reproduce, not once.
I'm giving it one last try on new workers, all with the default audit settings:

master:

1
2
3

7.x:

3
4
5

original-brownbear · 2019-11-13T10:07:22Z

@atorok that would confirm my suspicion that this is a GCP worker/box specific problem in some form.

I had a few cases where I could repeatedly reproduce it on one worker but failed to reproduce it over hours on another. Thinking about this some more I wonder if this may just be a GCP problem with their Debian kernels. Could we maybe build Debian images some other way (using different kernels?) to test Debian?

What it boils down to in the end still is, a non-blocking (according to docs and whatnot :)) sys-call is blocking in one specific environment. We verified that it actually is blocking by adding logging that ensures we're not just calling the accept0 in a loop. We also are not doing anything non-standard that may cause a dead-lock as far as I can see (and a lot of time was spent trying to find issues here ...). Kernel logs also didn't show anything in the failed runs I was able to reproduce, neither did netstat -s, maybe we should give up here and report this to GCP with what limited intel we have on the matter?

alpar-t · 2019-11-13T11:39:04Z

I agree. I also observed the same behavior. It might have some timing aspect to it as well, I noticed that if it failed on on VM it was more likely to fail on others too, but not on all. I noticed this when testing against multiple VMs at the same time.

alpar-t · 2019-11-13T11:39:55Z

Since we already removed debian 8 and 9 from the general pool, I'm just going to close this.

alpar-t · 2019-11-13T11:52:56Z

Here's the GCP issue: https://issuetracker.google.com/issues/144375215

mark-vieira · 2020-01-03T16:59:10Z

This continues to affect internal cluster test suites on Debian 8. Since that OS continues to be part of our support matrix we need to either fix the underlying issue (seems unlikely) or ignore these tests.

pgomulka · 2020-01-07T12:07:06Z

failed today on 7.x and debian
https://gradle-enterprise.elastic.co/s/mu6k6btb2dxna

original-brownbear · 2020-01-08T09:43:07Z

I've raised this with Infra now. Maybe we can fix this by simply moving the Debian tests to different testing infrastructure.

Tim-Brooks · 2020-01-08T16:31:17Z

I just want to note that when I looked into this yesterday and ran a bunch of searches using the Gradle enterprise stuff, it looked to me that Debian8 and Debian9 had the characteristic consistent failures due to things hanging (MockNioTransport obviously logged, netty and nio transport instances just mysteriously hung.) I did not see these types of failures for Debian10. I'm not sure if there is something obviously different in how the infrastructure is setup there.

mark-vieira · 2020-01-08T22:48:35Z

Maybe we can fix this by simply moving the Debian tests to different testing infrastructure.

Can you elaborate on what you mean by "different testing infrastructure"?

original-brownbear · 2020-01-21T10:24:26Z

@mark-vieira Sorry totally missed your comment: see linked infra issue on "different testing infrastructure" point :)
Sadly, I don't have a clear definition of how different it has to be. Testing on different platform/Cloud-provider seems the easiest solution though as I suggest in that issue.

ywelsch · 2020-02-18T12:39:35Z

There's nothing we can do on the ES side here, which is why I'm closing this. We have raised this with the infra team. In case anyone is hitting this issue on test triage (Debian CI worker only), feel free to just ignore.

original-brownbear added :Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI labels Jun 19, 2019

original-brownbear self-assigned this Jun 19, 2019

ywelsch mentioned this issue Jun 20, 2019

StableMasterDisruptionIT.testStaleMasterNotHijackingMajority failing in CI #43392

Closed

henningandersen mentioned this issue Jun 20, 2019

DiscoveryWithServiceDisruptionsIT#testAckedIndexing fails with NoShardAvailableActionException #24457

Closed

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 21, 2019

Assert ServerSocketChannel is not Blocking

40e6aab

* Relates elastic#43387 which appears to run into blocking accept calls

original-brownbear mentioned this issue Jun 21, 2019

Assert ServerSocketChannel is not Blocking #43479

Merged

original-brownbear added a commit that referenced this issue Jun 21, 2019

Assert ServerSocketChannel is not Blocking (#43479)

48aa716

* Assert ServerSocketChannel is not Blocking * Relates #43387 which appears to run into blocking accept calls

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 21, 2019

Assert ServerSocketChannel is not Blocking (elastic#43479)

6184a15

* Assert ServerSocketChannel is not Blocking * Relates elastic#43387 which appears to run into blocking accept calls

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 21, 2019

Assert ServerSocketChannel is not Blocking (elastic#43479)

a96052f

* Assert ServerSocketChannel is not Blocking * Relates elastic#43387 which appears to run into blocking accept calls

original-brownbear mentioned this issue Jun 21, 2019

Assert ServerSocketChannel is not Blocking (#43479) #43488

Merged

henningandersen closed this as completed Sep 25, 2019

original-brownbear reopened this Nov 11, 2019

alpar-t closed this as completed Nov 13, 2019

ywelsch mentioned this issue Nov 14, 2019

[CI] frequent master failures in 6.8 debian CI tests #49057

Closed

mark-vieira reopened this Jan 3, 2020

Tim-Brooks mentioned this issue Jan 6, 2020

SimpleNioTransportTests timeout failures and thread leaks #50674

Closed

ywelsch closed this as completed Feb 18, 2020

Tim-Brooks mentioned this issue May 16, 2023

[CI] ListenerActionIT testThreadedListeners failing #96142

Closed

Multiple Test Failures from Blocked accept0 Syscalls (Debian CI Runs only) #43387

Multiple Test Failures from Blocked accept0 Syscalls (Debian CI Runs only) #43387

Comments

original-brownbear commented Jun 19, 2019

elasticmachine commented Jun 19, 2019

henningandersen commented Jun 19, 2019

original-brownbear commented Jun 19, 2019

original-brownbear commented Jun 19, 2019 • edited Loading

henningandersen commented Jun 19, 2019

original-brownbear commented Jun 19, 2019

DaveCTurner commented Jun 20, 2019

original-brownbear commented Jun 20, 2019

henningandersen commented Jun 20, 2019

original-brownbear commented Jun 20, 2019 • edited Loading

henningandersen commented Jun 20, 2019

original-brownbear commented Jun 20, 2019

original-brownbear commented Jun 20, 2019

henningandersen commented Jun 20, 2019

original-brownbear commented Jun 20, 2019

mark-vieira commented Jun 20, 2019

original-brownbear commented Jun 21, 2019

alpar-t commented Jun 21, 2019

original-brownbear commented Jun 21, 2019

henningandersen commented Sep 25, 2019

original-brownbear commented Nov 11, 2019

alpar-t commented Nov 11, 2019

mark-vieira commented Nov 11, 2019

alpar-t commented Nov 12, 2019

alpar-t commented Nov 12, 2019

mark-vieira commented Nov 12, 2019

alpar-t commented Nov 13, 2019 • edited Loading

original-brownbear commented Nov 13, 2019 • edited Loading

alpar-t commented Nov 13, 2019

alpar-t commented Nov 13, 2019

alpar-t commented Nov 13, 2019

mark-vieira commented Jan 3, 2020

pgomulka commented Jan 7, 2020

original-brownbear commented Jan 8, 2020

Tim-Brooks commented Jan 8, 2020

mark-vieira commented Jan 8, 2020

original-brownbear commented Jan 21, 2020

ywelsch commented Feb 18, 2020

original-brownbear commented Jun 19, 2019 •

edited

Loading

original-brownbear commented Jun 20, 2019 •

edited

Loading

alpar-t commented Nov 13, 2019 •

edited

Loading

original-brownbear commented Nov 13, 2019 •

edited

Loading