[CI] RemoteClusterClientTests#testConnectAndExecuteRequest fails #41745

cbuescher · 2019-05-02T12:35:21Z

On master: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-9/369/console

Couldn't reproduce with:

./gradlew :server:test --tests "org.elasticsearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest" \
  -Dtests.seed=5BFCD458F15F6DA4 \
  -Dtests.security.manager=true \
  -Dtests.locale=tr \
  -Dtests.timezone=America/Atka \
  -Dcompiler.java=12 \
  -Druntime.java=11

java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([5BFCD458F15F6DA4:248A7562ED2FB133]:0)
	at org.junit.Assert.fail(Assert.java:86)
	at org.junit.Assert.assertTrue(Assert.java:41)
	at org.junit.Assert.assertTrue(Assert.java:52)
	at org.elasticsearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest(RemoteClusterClientTests.java:60)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-05-02T12:35:25Z

Pinging @elastic/es-search

cbuescher · 2019-05-02T12:37:00Z

Also seems to be an occasional issue on 7.x
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/1248/console

cbuescher · 2019-05-07T17:08:43Z

The assertion thats failing is checking that a remote node is connected through the remote cluster service (which is coming from a mock in this case since its ESTestCase). The log shows test timeouts:

 1> [2019-05-02T02:40:58,730][WARN ][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterClientTests] fetching nodes from external cluster [test] failed
  1> org.elasticsearch.common.util.CancellableThreads$ExecutionCancelledException: operation was cancelled reason [connect handler is closed]
  1> 	at org.elasticsearch.common.util.CancellableThreads.checkForCancel(CancellableThreads.java:65) ~[main/:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:130) ~[main/:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
  1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
  1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:687) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:834) [?:?]
  1> 	Suppressed: java.lang.IllegalStateException: Future got interrupted
  1> 		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[main/:?]
  1> 		at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34) ~[main/:?]
  1> 		at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:209) ~[main/:?]
  1> 		at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:80) ~[main/:?]
  1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:452) ~[main/:?]
  1> 		at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[main/:8.0.0-SNAPSHOT]
  1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
  1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
  1> 		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
  1> 		at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
  1> 		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:687) [main/:?]
  1> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1> 		at java.lang.Thread.run(Thread.java:834) [?:?]
  1> 	Caused by: java.lang.InterruptedException
  1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1040) ~[?:?]
  1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345) ~[?:?]

I can trigger this kind of error by adding a wait of more than 30s in RemoteClusterConnection#collectRemoteNodes() in the IOInterruptible function passed there, but I wonder why such a long wait could happen in practice, especially since this isn't an integration test.

There has been another instance of this on 5.4. (https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-9/372/console) which also shows the same kind of timeout.

cbuescher · 2019-05-08T11:44:38Z

This part of the stack trace shows the ConnectionManger hangs somewhere when opening the internal connection. The future that gets interupted is a listener passed to Transport#openConnection() which in this case is probably the MockNioTransport.

Suppressed: java.lang.IllegalStateException: Future got interrupted
  1> 		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[main/:?]
  1> 		at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34) ~[main/:?]
  1> 		at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:209) ~[main/:?]

droberts195 · 2019-05-08T13:49:34Z

The same problem as #41745 (comment) occurred in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/1360/, although the failing test was RemoteClusterConnectionTests.testNodeDisconnected rather than RemoteClusterClientTests.testConnectAndExecuteRequest.

14:29:21   1> [2019-05-08T08:29:21,244][WARN ][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterConnectionTests] fetching nodes from external cluster [test-cluster] failed
14:29:21   1> org.elasticsearch.common.util.CancellableThreads$ExecutionCancelledException: operation was cancelled reason [connect handler is closed]
14:29:21   1> 	at org.elasticsearch.common.util.CancellableThreads.checkForCancel(CancellableThreads.java:65) ~[main/:7.2.0-SNAPSHOT]
14:29:21   1> 	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:130) ~[main/:7.2.0-SNAPSHOT]
14:29:21   1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
14:29:21   1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
14:29:21   1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
14:29:21   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
14:29:21   1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
14:29:21   1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
14:29:21   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
14:29:21   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
14:29:21   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
14:29:21   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
14:29:21   1> 	Suppressed: java.lang.IllegalStateException: Future got interrupted
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:209) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:80) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:452) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[main/:7.2.0-SNAPSHOT]
14:29:21   1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
14:29:21   1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
14:29:21   1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
14:29:21   1> 		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
14:29:21   1> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
14:29:21   1> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
14:29:21   1> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
14:29:21   1> 		at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
14:29:21   1> 	Caused by: java.lang.InterruptedException
14:29:21   1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_212]
14:29:21   1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_212]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:57) ~[main/:?]
14:29:21   1> 		... 15 more

cbuescher · 2019-05-08T16:06:26Z

Talked to @original-brownbear who was investigating similar looking problems elsewhere lately and who suggested using e.g. Yourkit to identify potential locks and determine how much time threads spend waiting on them as a next step.

* Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745

original-brownbear · 2019-05-09T10:56:02Z

I opened #42000 to get us more useful logging on this situation

* Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745

* Dump Stacktrace on Slow IO-Thread Operations * Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745

* Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745

cbuescher · 2019-06-12T15:38:21Z

This appear to have failed again on master on May 4th (https://scans.gradle.com/s/ftmdnpa26cdty#tests)
and the 7.2 branch on May 26nd (https://scans.gradle.com/s/vcnmxcdmlqsdu#tests). At least the later might have
the improved logging from (#42572), although I'm not entirely sure that
has been backported to the 7.2 branch.

cbuescher · 2019-06-12T15:42:41Z

Looks like the logging from #42572 wasn't backported to 7.2. unfortunately. I'll take a look if the logs reveal anything interesting anyway.

cbuescher · 2019-06-12T15:57:52Z

The test logs from the latest failure on 7.2

1> [2019-05-25T21:12:38,915][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] before test
--
1> [2019-05-25T21:12:39,114][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:10304}, bound_addresses {[::1]:10303}, {127.0.0.1:10304}
1> [2019-05-25T21:12:39,145][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:10307}, bound_addresses {[::1]:10307}, {127.0.0.1:10307}
1> [2019-05-25T21:12:39,698][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] after test
1> [2019-05-25T21:12:39,711][INFO ][o.e.t.RemoteClusterClientTests] [testConnectAndExecuteRequest] before test
1> [2019-05-25T21:12:39,734][INFO ][o.e.t.TransportService   ] [testConnectAndExecuteRequest] publish_address {127.0.0.1:10304}, bound_addresses {[::1]:10303}, {127.0.0.1:10304}
1> [2019-05-25T21:12:39,749][INFO ][o.e.t.TransportService   ] [testConnectAndExecuteRequest] publish_address {127.0.0.1:10303}, bound_addresses {[::1]:10307}, {127.0.0.1:10303}
1> [2019-05-25T21:13:09,752][WARN ][o.e.t.RemoteClusterService] [testConnectAndExecuteRequest] failed to connect to remote clusters within 30s
1> [2019-05-25T21:13:09,755][WARN ][o.e.t.n.TestEventHandler ] [[elasticsearch[transport_worker][T#1]]] Slow execution on network thread [29984 milliseconds]
1> java.lang.RuntimeException: Slow exception on network thread
1> 	at org.elasticsearch.transport.nio.TestEventHandler.maybeLogElapsedTime(TestEventHandler.java:212) ~[framework-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:123) ~[framework-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:397) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1> [2019-05-25T21:13:09,851][WARN ][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterClientTests] fetching nodes from external cluster [test] failed
1> org.elasticsearch.common.util.CancellableThreads$ExecutionCancelledException: operation was cancelled reason [connect handler is closed]
1> 	at org.elasticsearch.common.util.CancellableThreads.checkForCancel(CancellableThreads.java:65) ~[main/:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:130) ~[main/:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1> 	Suppressed: org.elasticsearch.transport.ConnectTransportException: [remote_node][127.0.0.1:10304] general node connection failure
1> 		at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:127) ~[main/:?]
1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:467) ~[main/:?]
1> 		at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[main/:7.2.0-SNAPSHOT]
1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
1> 		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1> 		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
1> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
1> 		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
1> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
1> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
1> 		at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1> 	Caused by: java.lang.IllegalStateException: handshake failed with {remote_node}{nt4h9qp_QbysVWlbcfs1Kg}{jeF8L4uhSvaOyCiOz8IwOw}{127.0.0.1}{127.0.0.1:10304}
1> 		at org.elasticsearch.transport.TransportService.handshake(TransportService.java:421) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[main/:?]
1> 		at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[main/:?]
1> 		... 12 more
1> 	Caused by: java.lang.IllegalStateException: Future got interrupted
1> 		at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:47) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.handshake(TransportService.java:419) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[main/:?]
1> 		at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[main/:?]
1> 		... 12 more
1> 	Caused by: java.lang.InterruptedException
1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_212]
1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_212]
1> 		at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252) ~[main/:?]
1> 		at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87) ~[main/:?]
1> 		at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:44) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.handshake(TransportService.java:419) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[main/:?]
1> 		at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[main/:?]
1> 		... 12 more
1> [2019-05-25T21:13:09,864][INFO ][o.e.t.RemoteClusterClientTests] [testConnectAndExecuteRequest] after test

original-brownbear · 2019-06-12T20:30:51Z

@cbuescher I'm on PTO starting effectively right now so I might not get to this but: feel free to backport this to 7.2 to get a better shot a catching the deadlock here. If you back port the logging improvements though, please make sure to back-port exactly #42572 and not the underlying commit to master since the backport has a bug fix to this logging in it that is required to make it properly functional.

* Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745

cbuescher · 2019-06-13T08:46:56Z

Backported #42572 to 7.2 branch with 44f2db8 to get better logging there as well.

cbuescher · 2019-07-08T14:43:58Z

Just checked again, we got another one with potentially better logging on master on 24. June 2019 here:
https://scans.gradle.com/s/b4qsaeqsirvbw/tests/nfxodyych4uas-6q6452dcdcklg

original-brownbear · 2019-07-08T14:47:57Z

@cbuescher looks like a missed call to TransportService#acceptIncomingRequests somewhere, we're blocked on the latch that gets unblocked by a call to that method.

cbuescher · 2019-07-10T20:43:38Z

Didn't mean to close this, only added more logging. I will revisit this in time to see if we got more of these failures.

cbuescher · 2019-07-29T08:21:54Z

@original-brownbear do you think #44622 fixed this or should we keep it open for investigation for a bit longer?

original-brownbear · 2019-07-29T08:30:43Z

Let's wait for #44939 which should def. fix this (or at least have it fail very differently :D).

We currently block the transport thread on startup, which has caused test failures. I think this is some kind of deadlock situation. I don't think we should even block a transport thread, and there's also no need to do so. We can just reject requests as long we're not fully set up. Note that the HTTP layer is only started much later (after we've completed full start up of the transport layer), so that one should be completely unaffected by this. Closes #41745

cbuescher added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI labels May 2, 2019

cbuescher self-assigned this May 6, 2019

original-brownbear mentioned this issue May 9, 2019

Dump Stacktrace on Slow IO-Thread Operations #42000

Merged

original-brownbear mentioned this issue May 27, 2019

Dump Stacktrace on Slow IO-Thread Operations (#42000) #42572

Merged

cbuescher mentioned this issue Jul 9, 2019

[Test] Additional logging for RemoteClusterClientTests #44124

Merged

cbuescher closed this as completed in #44124 Jul 10, 2019

cbuescher reopened this Jul 10, 2019

original-brownbear mentioned this issue Jul 19, 2019

Multiple CCR tests with uncaught exception on windows #44610

Closed

original-brownbear mentioned this issue Jul 19, 2019

Don't Swallow Interrupt in TransportService#onRequestReceived #44622

Merged

original-brownbear mentioned this issue Jul 29, 2019

Do not block transport thread on startup #44939

Merged

ywelsch closed this as completed in #44939 Jul 29, 2019

[CI] RemoteClusterClientTests#testConnectAndExecuteRequest fails #41745

[CI] RemoteClusterClientTests#testConnectAndExecuteRequest fails #41745

Comments

cbuescher commented May 2, 2019

elasticmachine commented May 2, 2019

Uh oh!

cbuescher commented May 2, 2019

Uh oh!

cbuescher commented May 7, 2019

Uh oh!

cbuescher commented May 8, 2019

Uh oh!

droberts195 commented May 8, 2019

Uh oh!

cbuescher commented May 8, 2019

Uh oh!

original-brownbear commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbuescher commented Jun 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbuescher commented Jun 12, 2019

Uh oh!

cbuescher commented Jun 12, 2019

Uh oh!

original-brownbear commented Jun 12, 2019

Uh oh!

cbuescher commented Jun 13, 2019

Uh oh!

cbuescher commented Jul 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

original-brownbear commented Jul 8, 2019

Uh oh!

cbuescher commented Jul 10, 2019

Uh oh!

cbuescher commented Jul 29, 2019

Uh oh!

original-brownbear commented Jul 29, 2019

Uh oh!

original-brownbear commented May 9, 2019 •

edited

Loading

cbuescher commented Jun 12, 2019 •

edited

Loading

cbuescher commented Jul 8, 2019 •

edited

Loading