New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed #8714
Comments
@ejona86 This seems a bug. Looking at the code, I don't really understand the comment It seems not true if |
Also, kinda irrelevant, but to work around this I have to temporary disable idleTimer and found this code in
So to disable the idleTimer, I need to pass a value larger than 30 days. Passing |
Why do you say that? NameResolver is initialized in the constructor.
It looks like this is a timer bug. The timer firing is racing with the channel being forced into idle (which cancels the timer) by the outside caller.
grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java Lines 1353 to 1366 in 408ffe8
But the idle timer doesn't, nor does Rescheduler, as it just treats the SynchronizationContext as an Executor. grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java Lines 348 to 354 in 408ffe8
We need to either make the timer cancellation non-racy, or we need to add double-check behavior in IdleModeTimer. I generally prefer making timer cancellation non-racy, but Rescheduler will probably make that more annoying. For disabling idleTimeout, the expectation is you'd pass Long.MAX_VALUE or the like. Seems we are missing language like we have on keepAliveTime. We should add something similar to idleTimeout:
|
Sorry, I mixed up NameResolver being null, and NameResolver not started. The comment in the code is correct, the NameResolver is not null. My comment should have been: The following checkState seems not true if
|
If the resolver is not started then the channel is IDLE, thus there's no point in calling enterIdle(). That's why the current semantics are enterIdle() should only be called if non-IDLE. If it is being called when already IDLE (which is a bug in the caller), there are two possible approaches: 1) stop doing that; fix the code that is calling it inappropriately or 2) change its semantics to allow calling even when IDLE. |
@ejona86 Thoughts about why Approach 1 ("stop doing that; fix the code that is calling it inappropriately") is not solving anything:
Disabling |
This change in But I'm not sure if it's the right fix or the race should be addressed somewhere inside the
don't race with each other. Because they both are run under But obviously, according to the fact that Spikhalskiy@6d260b8 fixes the issue, something is missing. We probably read a stale version of |
The channel is initially IDLE and if the idleTimer expires too soon, it will try to I think Spikhalskiy/grpc-java@6d260b8 is the right fix. @ejona86 what do you think? |
@dapengzhang0, I think checking for |
I think the cause of When
the in-use status of the grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java Lines 2199 to 2205 in df4ac59
|
We believe this is a regression introduced by #8425, which was part of 1.41. |
Edit (2021/12/15): The following bug is fixed with a hot patch #8746. However, the logic of
idleTimer
inManagedChannel
is complex, delicate, and bug-prone. Refactoring is needed to make it more robust and easy to understand.What version of gRPC-Java are you using?
1.41.0, current master: 8382bd8
What is your environment?
Java 11/16, MacOS. Original user report is probably from a Linux server.
What did you expect to see?
We trigger
ManagedChannel#enterIdle
API periodically while also keepingManagedChannelImpl#idleTimer
with default settings. We expectManagedChannelImpl
to work properly and not to end up in panic.What did you see instead?
The channel is broken after it.
Steps to reproduce the bug
io.grpc.testing.integration.XdsTestClient
:Spikhalskiy@86a0000
io.grpc.testing.integration.XdsTestServer#main
io.grpc.testing.integration.XdsTestClient#main
with--num_channels=1000
XdsTestClient
logsRelevant report
The original user report in Temporal JavaSDK: temporalio/sdk-java#863
According to the report, the issue happened after a long period of channel inactivity (> idleTimer period).
My best guess is that our periodic
#enterIdle
shuts down thenameResolver
while still-scheduled (cancel(permanent=false)
) idleTimer sees a staleenabled=true
and performs the action. But I don't see how this can happen.The text was updated successfully, but these errors were encountered: