Skip to content

[CI] RemoteClusterClientTests#testConnectAndExecuteRequest fails #41745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cbuescher opened this issue May 2, 2019 · 17 comments · Fixed by #44124 or #44939
Closed

[CI] RemoteClusterClientTests#testConnectAndExecuteRequest fails #41745

cbuescher opened this issue May 2, 2019 · 17 comments · Fixed by #44124 or #44939
Assignees
Labels
:Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI

Comments

@cbuescher
Copy link
Member

On master: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-9/369/console

Couldn't reproduce with:

./gradlew :server:test --tests "org.elasticsearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest" \
  -Dtests.seed=5BFCD458F15F6DA4 \
  -Dtests.security.manager=true \
  -Dtests.locale=tr \
  -Dtests.timezone=America/Atka \
  -Dcompiler.java=12 \
  -Druntime.java=11
java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([5BFCD458F15F6DA4:248A7562ED2FB133]:0)
	at org.junit.Assert.fail(Assert.java:86)
	at org.junit.Assert.assertTrue(Assert.java:41)
	at org.junit.Assert.assertTrue(Assert.java:52)
	at org.elasticsearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest(RemoteClusterClientTests.java:60)
@cbuescher cbuescher added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI labels May 2, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@cbuescher
Copy link
Member Author

Also seems to be an occasional issue on 7.x
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/1248/console

@cbuescher cbuescher self-assigned this May 6, 2019
@cbuescher
Copy link
Member Author

The assertion thats failing is checking that a remote node is connected through the remote cluster service (which is coming from a mock in this case since its ESTestCase). The log shows test timeouts:

 1> [2019-05-02T02:40:58,730][WARN ][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterClientTests] fetching nodes from external cluster [test] failed
  1> org.elasticsearch.common.util.CancellableThreads$ExecutionCancelledException: operation was cancelled reason [connect handler is closed]
  1> 	at org.elasticsearch.common.util.CancellableThreads.checkForCancel(CancellableThreads.java:65) ~[main/:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:130) ~[main/:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
  1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
  1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:687) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:834) [?:?]
  1> 	Suppressed: java.lang.IllegalStateException: Future got interrupted
  1> 		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[main/:?]
  1> 		at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34) ~[main/:?]
  1> 		at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:209) ~[main/:?]
  1> 		at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:80) ~[main/:?]
  1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:452) ~[main/:?]
  1> 		at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[main/:8.0.0-SNAPSHOT]
  1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
  1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
  1> 		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
  1> 		at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
  1> 		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:687) [main/:?]
  1> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1> 		at java.lang.Thread.run(Thread.java:834) [?:?]
  1> 	Caused by: java.lang.InterruptedException
  1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1040) ~[?:?]
  1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345) ~[?:?]

I can trigger this kind of error by adding a wait of more than 30s in RemoteClusterConnection#collectRemoteNodes() in the IOInterruptible function passed there, but I wonder why such a long wait could happen in practice, especially since this isn't an integration test.

There has been another instance of this on 5.4. (https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-9/372/console) which also shows the same kind of timeout.

@cbuescher
Copy link
Member Author

This part of the stack trace shows the ConnectionManger hangs somewhere when opening the internal connection. The future that gets interupted is a listener passed to Transport#openConnection() which in this case is probably the MockNioTransport.

Suppressed: java.lang.IllegalStateException: Future got interrupted
  1> 		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[main/:?]
  1> 		at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34) ~[main/:?]
  1> 		at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:209) ~[main/:?]

@droberts195
Copy link
Contributor

The same problem as #41745 (comment) occurred in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/1360/, although the failing test was RemoteClusterConnectionTests.testNodeDisconnected rather than RemoteClusterClientTests.testConnectAndExecuteRequest.

14:29:21   1> [2019-05-08T08:29:21,244][WARN ][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterConnectionTests] fetching nodes from external cluster [test-cluster] failed
14:29:21   1> org.elasticsearch.common.util.CancellableThreads$ExecutionCancelledException: operation was cancelled reason [connect handler is closed]
14:29:21   1> 	at org.elasticsearch.common.util.CancellableThreads.checkForCancel(CancellableThreads.java:65) ~[main/:7.2.0-SNAPSHOT]
14:29:21   1> 	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:130) ~[main/:7.2.0-SNAPSHOT]
14:29:21   1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
14:29:21   1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
14:29:21   1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
14:29:21   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
14:29:21   1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
14:29:21   1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
14:29:21   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
14:29:21   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
14:29:21   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
14:29:21   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
14:29:21   1> 	Suppressed: java.lang.IllegalStateException: Future got interrupted
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:209) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:80) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:452) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[main/:7.2.0-SNAPSHOT]
14:29:21   1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
14:29:21   1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
14:29:21   1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
14:29:21   1> 		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
14:29:21   1> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
14:29:21   1> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
14:29:21   1> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
14:29:21   1> 		at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
14:29:21   1> 	Caused by: java.lang.InterruptedException
14:29:21   1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_212]
14:29:21   1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_212]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87) ~[main/:?]
14:29:21   1> 		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:57) ~[main/:?]
14:29:21   1> 		... 15 more

@cbuescher
Copy link
Member Author

Talked to @original-brownbear who was investigating similar looking problems elsewhere lately and who suggested using e.g. Yourkit to identify potential locks and determine how much time threads spend waiting on them as a next step.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue May 9, 2019
* Follow up to elastic#39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates elastic#41745
@original-brownbear
Copy link
Member

original-brownbear commented May 9, 2019

I opened #42000 to get us more useful logging on this situation

original-brownbear added a commit that referenced this issue May 22, 2019
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to #39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates #41745
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue May 27, 2019
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to elastic#39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates elastic#41745
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to elastic#39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates elastic#41745
original-brownbear added a commit that referenced this issue May 27, 2019
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to #39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates #41745
@cbuescher
Copy link
Member Author

cbuescher commented Jun 12, 2019

This appear to have failed again on master on May 4th (https://scans.gradle.com/s/ftmdnpa26cdty#tests)
and the 7.2 branch on May 26nd (https://scans.gradle.com/s/vcnmxcdmlqsdu#tests). At least the later might have
the improved logging from (#42572), although I'm not entirely sure that
has been backported to the 7.2 branch.

@cbuescher
Copy link
Member Author

Looks like the logging from #42572 wasn't backported to 7.2. unfortunately. I'll take a look if the logs reveal anything interesting anyway.

@cbuescher
Copy link
Member Author

The test logs from the latest failure on 7.2

1> [2019-05-25T21:12:38,915][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] before test
--
1> [2019-05-25T21:12:39,114][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:10304}, bound_addresses {[::1]:10303}, {127.0.0.1:10304}
1> [2019-05-25T21:12:39,145][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:10307}, bound_addresses {[::1]:10307}, {127.0.0.1:10307}
1> [2019-05-25T21:12:39,698][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] after test
1> [2019-05-25T21:12:39,711][INFO ][o.e.t.RemoteClusterClientTests] [testConnectAndExecuteRequest] before test
1> [2019-05-25T21:12:39,734][INFO ][o.e.t.TransportService   ] [testConnectAndExecuteRequest] publish_address {127.0.0.1:10304}, bound_addresses {[::1]:10303}, {127.0.0.1:10304}
1> [2019-05-25T21:12:39,749][INFO ][o.e.t.TransportService   ] [testConnectAndExecuteRequest] publish_address {127.0.0.1:10303}, bound_addresses {[::1]:10307}, {127.0.0.1:10303}
1> [2019-05-25T21:13:09,752][WARN ][o.e.t.RemoteClusterService] [testConnectAndExecuteRequest] failed to connect to remote clusters within 30s
1> [2019-05-25T21:13:09,755][WARN ][o.e.t.n.TestEventHandler ] [[elasticsearch[transport_worker][T#1]]] Slow execution on network thread [29984 milliseconds]
1> java.lang.RuntimeException: Slow exception on network thread
1> 	at org.elasticsearch.transport.nio.TestEventHandler.maybeLogElapsedTime(TestEventHandler.java:212) ~[framework-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:123) ~[framework-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:397) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129) ~[elasticsearch-nio-7.2.0-SNAPSHOT.jar:7.2.0-SNAPSHOT]
1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1> [2019-05-25T21:13:09,851][WARN ][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterClientTests] fetching nodes from external cluster [test] failed
1> org.elasticsearch.common.util.CancellableThreads$ExecutionCancelledException: operation was cancelled reason [connect handler is closed]
1> 	at org.elasticsearch.common.util.CancellableThreads.checkForCancel(CancellableThreads.java:65) ~[main/:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:130) ~[main/:7.2.0-SNAPSHOT]
1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
1> 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1> 	Suppressed: org.elasticsearch.transport.ConnectTransportException: [remote_node][127.0.0.1:10304] general node connection failure
1> 		at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:127) ~[main/:?]
1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:467) ~[main/:?]
1> 		at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[main/:7.2.0-SNAPSHOT]
1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:446) [main/:?]
1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:336) [main/:?]
1> 		at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:434) [main/:?]
1> 		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1> 		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
1> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
1> 		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [main/:?]
1> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
1> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
1> 		at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1> 	Caused by: java.lang.IllegalStateException: handshake failed with {remote_node}{nt4h9qp_QbysVWlbcfs1Kg}{jeF8L4uhSvaOyCiOz8IwOw}{127.0.0.1}{127.0.0.1:10304}
1> 		at org.elasticsearch.transport.TransportService.handshake(TransportService.java:421) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[main/:?]
1> 		at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[main/:?]
1> 		... 12 more
1> 	Caused by: java.lang.IllegalStateException: Future got interrupted
1> 		at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:47) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.handshake(TransportService.java:419) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[main/:?]
1> 		at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[main/:?]
1> 		... 12 more
1> 	Caused by: java.lang.InterruptedException
1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_212]
1> 		at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_212]
1> 		at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252) ~[main/:?]
1> 		at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87) ~[main/:?]
1> 		at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:44) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.handshake(TransportService.java:419) ~[main/:?]
1> 		at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[main/:?]
1> 		at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[main/:?]
1> 		... 12 more
1> [2019-05-25T21:13:09,864][INFO ][o.e.t.RemoteClusterClientTests] [testConnectAndExecuteRequest] after test


@original-brownbear
Copy link
Member

@cbuescher I'm on PTO starting effectively right now so I might not get to this but: feel free to backport this to 7.2 to get a better shot a catching the deadlock here. If you back port the logging improvements though, please make sure to back-port exactly #42572 and not the underlying commit to master since the backport has a bug fix to this logging in it that is required to make it properly functional.

cbuescher pushed a commit that referenced this issue Jun 13, 2019
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to #39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates #41745
@cbuescher
Copy link
Member Author

Backported #42572 to 7.2 branch with 44f2db8 to get better logging there as well.

@cbuescher
Copy link
Member Author

cbuescher commented Jul 8, 2019

Just checked again, we got another one with potentially better logging on master on 24. June 2019 here:
https://scans.gradle.com/s/b4qsaeqsirvbw/tests/nfxodyych4uas-6q6452dcdcklg

@original-brownbear
Copy link
Member

@cbuescher looks like a missed call to TransportService#acceptIncomingRequests somewhere, we're blocked on the latch that gets unblocked by a call to that method.

@cbuescher
Copy link
Member Author

Didn't mean to close this, only added more logging. I will revisit this in time to see if we got more of these failures.

@cbuescher
Copy link
Member Author

@original-brownbear do you think #44622 fixed this or should we keep it open for investigation for a bit longer?

@original-brownbear
Copy link
Member

Let's wait for #44939 which should def. fix this (or at least have it fail very differently :D).

ywelsch added a commit that referenced this issue Jul 29, 2019
We currently block the transport thread on startup, which has caused test failures. I think this is
some kind of deadlock situation. I don't think we should even block a transport thread, and
there's also no need to do so. We can just reject requests as long we're not fully set up. Note
that the HTTP layer is only started much later (after we've completed full start up of the
transport layer), so that one should be completely unaffected by this.

Closes #41745
ywelsch added a commit that referenced this issue Jul 29, 2019
We currently block the transport thread on startup, which has caused test failures. I think this is
some kind of deadlock situation. I don't think we should even block a transport thread, and
there's also no need to do so. We can just reject requests as long we're not fully set up. Note
that the HTTP layer is only started much later (after we've completed full start up of the
transport layer), so that one should be completely unaffected by this.

Closes #41745
jkakavas pushed a commit that referenced this issue Jul 31, 2019
We currently block the transport thread on startup, which has caused test failures. I think this is
some kind of deadlock situation. I don't think we should even block a transport thread, and
there's also no need to do so. We can just reject requests as long we're not fully set up. Note
that the HTTP layer is only started much later (after we've completed full start up of the
transport layer), so that one should be completely unaffected by this.

Closes #41745
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI
Projects
None yet
4 participants