You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One thread trying to cancel and another thread trying to create a new stream, both holding the opposing locks.
Also happens on 2 threads trying to cancel.
Steps to reproduce the bug
I'm working on an application with ~10m requests per day and unfortunately a not optimal network which causes quite frequent retries.
I can't reproduce the issue nor do I know what is causing this exactly, but the application was running with the non-hedging retry for a few months now without any deadlock. Recently we switched to hedging retry, to remedy the network delays. Since then after a few hours or days we see servers start having deadlocks (see stacktraces of both threads attached).
Not exactly sure if this is caused by hedging but one code path in io.grpc.internal.RetriableStream$1CommitTask.run(RetriableStream.java:194)
mentions that this is only used for hedging.
ejona86
changed the title
Hedging retry seems to cause a deadlock in rare cases
Hedging retry seems to cause a deadlock in rare cases with OkHttp
Jun 30, 2023
This issue is limited to OkHttp. This shouldn't be possible to trigger with Netty. But it is a RetriableStream bug.
In 1.54, I think this would be limited to hedging. But in 1.55 it might be able to happen with normal retries because of #10007 .
Generally with issues like this we want to break the nested lock in both directions (i.e., newStream and cancel). We might be able to delay newStream until we jump to the application thread for draining. Cancel, we could dump onto the application thread as well. The newStream is already safe if called from scheduledExecutorService.
Uh oh!
There was an error while loading. Please reload this page.
What version of gRPC-Java are you using?
1.54.1
What is your environment?
JDK 1.8, Weblogic 12
What did you expect to see?
No deadlocked threads
What did you see instead?
One thread trying to cancel and another thread trying to create a new stream, both holding the opposing locks.
Also happens on 2 threads trying to cancel.
Steps to reproduce the bug
I'm working on an application with ~10m requests per day and unfortunately a not optimal network which causes quite frequent retries.
I can't reproduce the issue nor do I know what is causing this exactly, but the application was running with the non-hedging retry for a few months now without any deadlock. Recently we switched to hedging retry, to remedy the network delays. Since then after a few hours or days we see servers start having deadlocks (see stacktraces of both threads attached).
Not exactly sure if this is caused by hedging but one code path in
io.grpc.internal.RetriableStream$1CommitTask.run(RetriableStream.java:194)
mentions that this is only used for hedging.
stacktraces.txt
stacktraces_2.txt
The text was updated successfully, but these errors were encountered: