Skip to content

RecoverySourceHandler#runWithGenericThreadPool caused deadlock #85839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #77466
DaveCTurner opened this issue Apr 12, 2022 · 2 comments · Fixed by #86127
Closed
Tracked by #77466

RecoverySourceHandler#runWithGenericThreadPool caused deadlock #85839

DaveCTurner opened this issue Apr 12, 2022 · 2 comments · Fixed by #86127
Assignees
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Apr 12, 2022

We saw a benchmark of a 7.17.2 cluster get stuck with all generic threads blocked in RecoverySourceHandler#runWithGenericThreadPool (see many-shards-threaddump.txt.gz):

$ python ~/src/jstack2json/jstack2json.py many-shards-threaddump.txt | jq '.[].threads[] | select(.elasticsearch.threadpool == "generic")' -cMr | wc -l
     144
$ python ~/src/jstack2json/jstack2json.py many-shards-threaddump.txt | jq '.[].threads[] | select(.elasticsearch.threadpool == "generic") | .stack' -cMr | sed -e 's/parking to wait for  <0x[0-9a-f]*>//' | sort -u | wc -l
       1
$ python ~/src/jstack2json/jstack2json.py many-shards-threaddump.txt | jq '.[].threads[] | select(.elasticsearch.threadpool == "generic") | .stack' -cMr | head -n1 | jq '.[]' -cMr
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for  <0x0000000496000040> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1047)
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:243)
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:75)
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
at org.elasticsearch.indices.recovery.RecoverySourceHandler.runWithGenericThreadPool(RecoverySourceHandler.java:486)
at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$acquireSafeCommit$21(RecoverySourceHandler.java:476)
at org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$6704/0x0000000801a21000.run(Unknown Source)
at org.elasticsearch.index.engine.Engine$IndexCommitRef.close(Engine.java:1919)
at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:74)
at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:116)
at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:66)
at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$recoverToTarget$7(RecoverySourceHandler.java:283)
at org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$6706/0x0000000801a21480.accept(Unknown Source)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136)
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:113)
at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100)
at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:131)
at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:139)
at org.elasticsearch.action.StepListener.innerOnResponse(StepListener.java:52)
at org.elasticsearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:29)
at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$recoverFilesFromSourceAndSnapshot$32(RecoverySourceHandler.java:762)
at org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$6735/0x0000000801890000.accept(Unknown Source)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136)
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:113)
at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100)
at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:131)
at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:139)
at org.elasticsearch.action.StepListener.innerOnResponse(StepListener.java:52)
at org.elasticsearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:29)
at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101)
at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:186)
at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101)
at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:389)
at org.elasticsearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:143)
at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:389)
at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43)
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471)
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471)
at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:340)
at org.elasticsearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:328)
at org.elasticsearch.transport.InboundHandler$$Lambda$5302/0x000000080188aa90.run(Unknown Source)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
at java.lang.Thread.run([email protected]/Thread.java:833)

Do we really need to block in this method, or can these actions just be fire-and-forget things? I couldn't see an obvious reason for needing to wait for them to complete.

(FWIW this was clearly caused by setting cluster.routing.allocation.node_concurrent_recoveries too high, but really we shouldn't deadlock in any configuration)

Relates #77466

@DaveCTurner DaveCTurner added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Apr 12, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 12, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear
Copy link
Member

I'll see if I can find a quick solution here, would be nice to have this fixed for benchmarks.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 25, 2022
Remove the blocking wait when releasing safe commits or store references on the recovery source.
This seems safe enough these days with elastic#85238 not tripping
and the assertion that makes sure we never submit the close task to an already shut-down pool

closes elastic#85839
original-brownbear added a commit that referenced this issue May 17, 2022
Remove the blocking wait when releasing safe commits or store references on the recovery source.
This seems safe enough these days with #85238 not tripping
and the assertion that makes sure we never submit the close task to an already shut-down pool

closes #85839
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants