Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hedging retry seems to cause a deadlock in rare cases with OkHttp #10314

Closed
thomob opened this issue Jun 27, 2023 · 4 comments · Fixed by #10386
Closed

Hedging retry seems to cause a deadlock in rare cases with OkHttp #10314

thomob opened this issue Jun 27, 2023 · 4 comments · Fixed by #10386
Assignees
Labels
Milestone

Comments

@thomob
Copy link

thomob commented Jun 27, 2023

What version of gRPC-Java are you using?

1.54.1

What is your environment?

JDK 1.8, Weblogic 12

What did you expect to see?

No deadlocked threads

What did you see instead?

One thread trying to cancel and another thread trying to create a new stream, both holding the opposing locks.

Also happens on 2 threads trying to cancel.

Steps to reproduce the bug

I'm working on an application with ~10m requests per day and unfortunately a not optimal network which causes quite frequent retries.
I can't reproduce the issue nor do I know what is causing this exactly, but the application was running with the non-hedging retry for a few months now without any deadlock. Recently we switched to hedging retry, to remedy the network delays. Since then after a few hours or days we see servers start having deadlocks (see stacktraces of both threads attached).

Not exactly sure if this is caused by hedging but one code path in
io.grpc.internal.RetriableStream$1CommitTask.run(RetriableStream.java:194)
mentions that this is only used for hedging.

stacktraces.txt

stacktraces_2.txt

@ranavivek04
Copy link

are you using timeout for matching methods along with hedging policy?

@ejona86 ejona86 added this to the Next milestone Jun 30, 2023
@ejona86 ejona86 removed the okhttp label Jun 30, 2023
@ejona86 ejona86 changed the title Hedging retry seems to cause a deadlock in rare cases Hedging retry seems to cause a deadlock in rare cases with OkHttp Jun 30, 2023
@ejona86
Copy link
Member

ejona86 commented Jun 30, 2023

This issue is limited to OkHttp. This shouldn't be possible to trigger with Netty. But it is a RetriableStream bug.

In 1.54, I think this would be limited to hedging. But in 1.55 it might be able to happen with normal retries because of #10007 .

Generally with issues like this we want to break the nested lock in both directions (i.e., newStream and cancel). We might be able to delay newStream until we jump to the application thread for draining. Cancel, we could dump onto the application thread as well. The newStream is already safe if called from scheduledExecutorService.

@thomob
Copy link
Author

thomob commented Jul 11, 2023

are you using timeout for matching methods along with hedging policy?

Yes, we hedge 3-4 times with a short delay depending on the method and after a few seconds we timeout the call.

@thomob
Copy link
Author

thomob commented Jul 11, 2023

This issue is limited to OkHttp. This shouldn't be possible to trigger with Netty. But it is a RetriableStream bug.

Good hint. Thank you. We will try it with netty.

@ejona86 ejona86 modified the milestones: Next, 1.58 Jul 27, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants