Hedging retry seems to cause a deadlock in rare cases with OkHttp #10314

thomob · 2023-06-27T10:43:36Z

What version of gRPC-Java are you using?

1.54.1

What is your environment?

JDK 1.8, Weblogic 12

What did you expect to see?

No deadlocked threads

What did you see instead?

One thread trying to cancel and another thread trying to create a new stream, both holding the opposing locks.

Also happens on 2 threads trying to cancel.

Steps to reproduce the bug

I'm working on an application with ~10m requests per day and unfortunately a not optimal network which causes quite frequent retries.
I can't reproduce the issue nor do I know what is causing this exactly, but the application was running with the non-hedging retry for a few months now without any deadlock. Recently we switched to hedging retry, to remedy the network delays. Since then after a few hours or days we see servers start having deadlocks (see stacktraces of both threads attached).

Not exactly sure if this is caused by hedging but one code path in
io.grpc.internal.RetriableStream$1CommitTask.run(RetriableStream.java:194)
mentions that this is only used for hedging.

stacktraces.txt

stacktraces_2.txt

The text was updated successfully, but these errors were encountered:

ranavivek04 · 2023-06-30T11:45:20Z

are you using timeout for matching methods along with hedging policy?

ejona86 · 2023-06-30T15:22:16Z

This issue is limited to OkHttp. This shouldn't be possible to trigger with Netty. But it is a RetriableStream bug.

In 1.54, I think this would be limited to hedging. But in 1.55 it might be able to happen with normal retries because of #10007 .

Generally with issues like this we want to break the nested lock in both directions (i.e., newStream and cancel). We might be able to delay newStream until we jump to the application thread for draining. Cancel, we could dump onto the application thread as well. The newStream is already safe if called from scheduledExecutorService.

thomob · 2023-07-11T09:12:55Z

are you using timeout for matching methods along with hedging policy?

Yes, we hedge 3-4 times with a short delay depending on the method and after a few seconds we timeout the call.

thomob · 2023-07-11T09:14:27Z

This issue is limited to OkHttp. This shouldn't be possible to trigger with Netty. But it is a RetriableStream bug.

Good hint. Thank you. We will try it with netty.

ejona86 added okhttp bug labels Jun 30, 2023

ejona86 added this to the Next milestone Jun 30, 2023

ejona86 removed the okhttp label Jun 30, 2023

ejona86 assigned YifeiZhuang Jun 30, 2023

ejona86 changed the title ~~Hedging retry seems to cause a deadlock in rare cases~~ Hedging retry seems to cause a deadlock in rare cases with OkHttp Jun 30, 2023

ejona86 mentioned this issue Jun 30, 2023

gRPC retry behave weird when timeout is set along with retry policy #10336

Closed

YifeiZhuang mentioned this issue Jul 18, 2023

core: fix retriablestream deadlock #10386

Merged

YifeiZhuang closed this as completed in #10386 Jul 21, 2023

ejona86 modified the milestones: Next, 1.58 Jul 27, 2023

github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hedging retry seems to cause a deadlock in rare cases with OkHttp #10314

Hedging retry seems to cause a deadlock in rare cases with OkHttp #10314

thomob commented Jun 27, 2023 •

edited

Loading

ranavivek04 commented Jun 30, 2023

Uh oh!

ejona86 commented Jun 30, 2023

Uh oh!

thomob commented Jul 11, 2023

Uh oh!

thomob commented Jul 11, 2023

Uh oh!

Hedging retry seems to cause a deadlock in rare cases with OkHttp #10314

Hedging retry seems to cause a deadlock in rare cases with OkHttp #10314

Comments

thomob commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

ranavivek04 commented Jun 30, 2023

Uh oh!

ejona86 commented Jun 30, 2023

Uh oh!

thomob commented Jul 11, 2023

Uh oh!

thomob commented Jul 11, 2023

Uh oh!

thomob commented Jun 27, 2023 •

edited

Loading