Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed #8714

Spikhalskiy · 2021-11-18T22:57:01Z

Edit (2021/12/15): The following bug is fixed with a hot patch #8746. However, the logic of idleTimer in ManagedChannel is complex, delicate, and bug-prone. Refactoring is needed to make it more robust and easy to understand.

What version of gRPC-Java are you using?

1.41.0, current master: 8382bd8

What is your environment?

Java 11/16, MacOS. Original user report is probably from a Linux server.

What did you expect to see?

We trigger ManagedChannel#enterIdle API periodically while also keeping ManagedChannelImpl#idleTimer with default settings. We expect ManagedChannelImpl to work properly and not to end up in panic.

What did you see instead?

Caused by: io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug!\
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\
...

Caused by: java.lang.IllegalStateException: nameResolver is not started\
at com.google.common.base.Preconditions.checkState(Preconditions.java:502)\
at io.grpc.internal.ManagedChannelImpl.shutdownNameResolverAndLoadBalancer(ManagedChannelImpl.java:360)\
at io.grpc.internal.ManagedChannelImpl.enterIdleMode(ManagedChannelImpl.java:422)\
at io.grpc.internal.ManagedChannelImpl.access$900(ManagedChannelImpl.java:118)\
at io.grpc.internal.ManagedChannelImpl$IdleModeTimer.run(ManagedChannelImpl.java:352)\
at io.grpc.internal.Rescheduler$ChannelFutureRunnable.run(Rescheduler.java:103)\
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)\
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)\
at io.grpc.internal.Rescheduler$FutureRunnable.run(Rescheduler.java:80)\
at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\
at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\

The channel is broken after it.

Steps to reproduce the bug

Apply the change to io.grpc.testing.integration.XdsTestClient:
Spikhalskiy@86a0000
Start io.grpc.testing.integration.XdsTestServer#main
Start io.grpc.testing.integration.XdsTestClient#main with --num_channels=1000
Find the reported exception in XdsTestClient logs

Relevant report

The original user report in Temporal JavaSDK: temporalio/sdk-java#863
According to the report, the issue happened after a long period of channel inactivity (> idleTimer period).
My best guess is that our periodic #enterIdle shuts down the nameResolver while still-scheduled (cancel(permanent=false)) idleTimer sees a stale enabled=true and performs the action. But I don't see how this can happen.

The text was updated successfully, but these errors were encountered:

dapengzhang0 · 2021-11-19T19:20:05Z

@ejona86 This seems a bug. Looking at the code, I don't really understand the comment
https://github.com/grpc/grpc-java/blob/v1.42.1/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L418-L422

It seems not true if enterIdle() is called even before the resolver gets started.

Spikhalskiy · 2021-11-19T20:00:56Z

Also, kinda irrelevant, but to work around this I have to temporary disable idleTimer and found this code in ManagedChannelImplBuilder#idleTimeout:

    if (unit.toDays(value) >= IDLE_MODE_MAX_TIMEOUT_DAYS) {
      // This disables idle mode
      this.idleTimeoutMillis = ManagedChannelImpl.IDLE_TIMEOUT_MILLIS_DISABLE;

So to disable the idleTimer, I need to pass a value larger than 30 days. Passing ManagedChannelImpl.IDLE_TIMEOUT_MILLIS_DISABLE in (which is -1) doesn't work and leads to an exception.
Would we great if ManagedChannelBuilder provides a better interface for this. I would personally prefer having a public constant equals to -1 and mentioning if this in the ManagedChannelBuilder#idleTimeout javadoc.

ejona86 · 2021-11-20T00:53:20Z

It seems not true if enterIdle() is called even before the resolver gets started.

Why do you say that? NameResolver is initialized in the constructor.

grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

Line 662 in 408ffe8

this.nameResolver = getNameResolver(

It looks like this is a timer bug. The timer firing is racing with the channel being forced into idle (which cancels the timer) by the outside caller.

enterIdle() has a check to avoid re-entering idle mode (lbHelper == null):

grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

Lines 1353 to 1366 in 408ffe8

    
           public void enterIdle() { 
        
             final class PrepareToLoseNetworkRunnable implements Runnable { 
        
               @Override 
        
               public void run() { 
        
                 if (shutdown.get() || lbHelper == null) { 
        
                   return; 
        
                 } 
        
                 cancelIdleTimer(/* permanent= */ false); 
        
                 enterIdleMode(); 
        
               } 
        
             } 
        
             syncContext.execute(new PrepareToLoseNetworkRunnable()); 
        
           }

But the idle timer doesn't, nor does Rescheduler, as it just treats the SynchronizationContext as an Executor.

grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

Lines 348 to 354 in 408ffe8

    
           private class IdleModeTimer implements Runnable { 
        
             @Override 
        
             public void run() { 
        
               enterIdleMode(); 
        
             } 
        
           }

We need to either make the timer cancellation non-racy, or we need to add double-check behavior in IdleModeTimer. I generally prefer making timer cancellation non-racy, but Rescheduler will probably make that more annoying.

For disabling idleTimeout, the expectation is you'd pass Long.MAX_VALUE or the like. Seems we are missing language like we have on keepAliveTime. We should add something similar to idleTimeout:

Long.MAX_VALUE nano seconds or an unreasonably large value will disable keepalive

dapengzhang0 · 2021-11-22T22:01:16Z

It seems not true if enterIdle() is called even before the resolver gets started.

Why do you say that? NameResolver is initialized in the constructor.

Sorry, I mixed up NameResolver being null, and NameResolver not started. The comment in the code is correct, the NameResolver is not null. My comment should have been:

The following checkState seems not true if enterIdle() is called even before the resolver gets started:

grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

Line 360 in 408ffe8

checkState(nameResolverStarted, "nameResolver is not started");

ejona86 · 2021-11-22T22:22:50Z

The following checkState seems not true if enterIdle() is called even before the resolver gets started:

If the resolver is not started then the channel is IDLE, thus there's no point in calling enterIdle(). That's why the current semantics are enterIdle() should only be called if non-IDLE. If it is being called when already IDLE (which is a bug in the caller), there are two possible approaches: 1) stop doing that; fix the code that is calling it inappropriately or 2) change its semantics to allow calling even when IDLE.

Spikhalskiy · 2021-11-23T00:00:09Z

@ejona86
Approach 2 is already implemented.
Calling enterIdle is already fine if the channel is IDLE. There is a check inside enterIdle for lbHelper==null that makes sure enterIdle does nothing if already IDLE.

Thoughts about why Approach 1 ("stop doing that; fix the code that is calling it inappropriately") is not solving anything:

Following the exception stacktrace, most likely enterIdle is called before idleTimer and the race happens after enterIdle before idleTimer. It's a bit unclear how checking for IDLE before enterIdle from the caller will help here. (exception about the unexpected state is coming from the idleTimer and not enterIdle).
It will not help either because there is effectively a check inside enterIdle for IDLE state that is running under SynchronizationContext (

grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

Line 1360 in 6d260b8

if (shutdown.get() || lbHelper == null) {

). And additional check outside can't help here. So it's not a solution. The problem is race, not double call to enterIdle.
ManagedChannel implementation is supposed to be thread-safe. The Approach4. 1 requirement to perform a check for IDLE before calling enterIdle will require synchronization (because idleTimer can fire in between) which makes it non-thread-safe.
And what is our caller even supposed to use for synchronization described in (3) to make solution 1 happen?

Disabling idleTimer in the provided reproduction fixes the test. So, the problem is obviously not in enterIdle contract, the issue is in race condition.
The problem is PROBABLY is that there is no check inside idleTimer and cancelation of idleTimer inside enterIdle races with its scheduling/executing. So, the absence of check during execution + race between cancelation of scheduling and scheduled execution itself may lead to the race.

Spikhalskiy · 2021-11-23T00:17:00Z

This change in IdleTimer fixes the race and saves all the current contracts:
Spikhalskiy@6d260b8

But I'm not sure if it's the right fix or the race should be addressed somewhere inside the io.grpc.internal.Rescheduler or io.grpc.SynchronizationContext.
Because I would expect that

Rescheduler#cancel(permanent = false) call that sets enabled=false for idleTimer inside enterIdle and
idleTimer's check of the enabled flag inside ChannelFutureRunnable#run

don't race with each other. Because they both are run under SynchronizationContext#execute. And I would expect that SynchronizationContext guarantees happens-before between executions of Runnables. Which looks like it attempts to provide by using a queue to get happens-before between add and poll and by using CAS to get happens-before between the processing of polls that could happen in different threads.

But obviously, according to the fact that Spikhalskiy@6d260b8 fixes the issue, something is missing. We probably read a stale version of enabled and that's why this additional lbHelper==null check in idleTimer helps.
So, that's why I don't propose Spikhalskiy@6d260b8 as a fix because this looks like something may be broken with SynchronizationContext and this change may be treating a symptom instead of a bigger issue.

dapengzhang0 · 2021-12-06T23:50:42Z

The channel is initially IDLE and if the idleTimer expires too soon, it will try to enterIdleMode() at the initial IDLE state.

I think Spikhalskiy/grpc-java@6d260b8 is the right fix.

@ejona86 what do you think?

ejona86 · 2021-12-07T16:57:36Z

@dapengzhang0, I think checking for lbHelper == null would fix this bug. It does make some parts of the code unclear (e.g., why wouldn't enterIdleMode() just do the check?) and can hide other bugs. But it is fast and easy and seems a fair initial fix.

dapengzhang0 · 2021-12-07T20:15:08Z

But I'm not sure if it's the right fix or the race should be addressed somewhere inside the io.grpc.internal.Rescheduler or io.grpc.SynchronizationContext.
Because I would expect that

Rescheduler#cancel(permanent = false) call that sets enabled=false for idleTimer inside enterIdle and

idleTimer's check of the enabled flag inside ChannelFutureRunnable#run

don't race with each other. Because they both are run under SynchronizationContext#execute.

I think the cause of enabled spuriously becoming to true might be:

When enterIdleMode(), the ManagedChannelImpl.inUseStateAggregator was not checking

grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

Line 429 in df4ac59

    
           if (inUseStateAggregator.anyObjectInUse(pendingCallsInUseObject, delayedTransport)) {

the in-use status of the InternalSubchannel, unlike exitIdleMode(). So even though Rescheduler#cancel(permanent = false) sets enabled=false, when InternalSubchannel becomes not in-use, rescheduleIdleTimer(), namely Rescheduler.enabled=true, can be called by

grpc-java/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

Lines 2199 to 2205 in df4ac59

    
             protected void handleNotInUse() { 
        
               if (shutdown.get()) { 
        
                 return; 
        
               } 
        
               rescheduleIdleTimer(); 
        
             } 
        
           }

ejona86 · 2021-12-14T19:16:50Z

We believe this is a regression introduced by #8425, which was part of 1.41.

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of #8714

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of grpc#8714

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of #8714

Spikhalskiy added a commit to Spikhalskiy/grpc-java that referenced this issue Nov 19, 2021

Add a reproduction for grpc#8714

9b9e673

Spikhalskiy added a commit to Spikhalskiy/grpc-java that referenced this issue Nov 19, 2021

Add a reproduction for grpc#8714

86a0000

Spikhalskiy changed the title ~~Possible race condition in idleTimer & ManagedChannel#enterIdle~~ Race condition in idleTimer & ManagedChannel#enterIdle Nov 19, 2021

Spikhalskiy mentioned this issue Nov 19, 2021

Internal error in gRPC channel of WorkflowServiceStubs temporalio/sdk-java#863

Closed

dapengzhang0 added the bug label Nov 19, 2021

dapengzhang0 added this to the Next milestone Nov 22, 2021

dapengzhang0 self-assigned this Nov 22, 2021

This was referenced Dec 7, 2021

core Race condition in idleTimer & ManagedChannel#enterIdle #8745

Closed

core: Race condition in idleTimer & ManagedChannel#enterIdle #8746

Merged

dapengzhang0 added a commit that referenced this issue Dec 15, 2021

core: fix race condition in idleTimer & ManagedChannel#enterIdle

531db50

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of #8714

dapengzhang0 changed the title ~~Race condition in idleTimer & ManagedChannel#enterIdle~~ Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed Dec 15, 2021

dapengzhang0 added enhancement and removed bug labels Dec 15, 2021

dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this issue Dec 15, 2021

core: fix race condition in idleTimer & ManagedChannel#enterIdle

176910f

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of grpc#8714

dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this issue Dec 15, 2021

core: fix race condition in idleTimer & ManagedChannel#enterIdle

5cdbaf4

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of grpc#8714

dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this issue Dec 15, 2021

core: fix race condition in idleTimer & ManagedChannel#enterIdle

6e65a85

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of grpc#8714

dapengzhang0 added a commit that referenced this issue Dec 16, 2021

core: fix race condition in idleTimer & ManagedChannel#enterIdle

736b398

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of #8714

dapengzhang0 added a commit that referenced this issue Dec 16, 2021

core: fix race condition in idleTimer & ManagedChannel#enterIdle

d100d46

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of #8714

dapengzhang0 added a commit that referenced this issue Dec 16, 2021

core: fix race condition in idleTimer & ManagedChannel#enterIdle

aadc2ca

PR for Spikhalskiy/grpc-java@6d260b8 Hot fix of #8714

dapengzhang0 assigned ejona86 and unassigned dapengzhang0 Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed #8714

Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed #8714

Spikhalskiy commented Nov 18, 2021 •

edited by dapengzhang0

dapengzhang0 commented Nov 19, 2021 •

edited

Spikhalskiy commented Nov 19, 2021 •

edited

ejona86 commented Nov 20, 2021 •

edited

dapengzhang0 commented Nov 22, 2021 •

edited

ejona86 commented Nov 22, 2021

Spikhalskiy commented Nov 23, 2021 •

edited

Spikhalskiy commented Nov 23, 2021 •

edited

dapengzhang0 commented Dec 6, 2021

ejona86 commented Dec 7, 2021

dapengzhang0 commented Dec 7, 2021

ejona86 commented Dec 14, 2021

Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed #8714

Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed #8714

Comments

Spikhalskiy commented Nov 18, 2021 • edited by dapengzhang0

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

Relevant report

dapengzhang0 commented Nov 19, 2021 • edited

Spikhalskiy commented Nov 19, 2021 • edited

ejona86 commented Nov 20, 2021 • edited

dapengzhang0 commented Nov 22, 2021 • edited

ejona86 commented Nov 22, 2021

Spikhalskiy commented Nov 23, 2021 • edited

Spikhalskiy commented Nov 23, 2021 • edited

dapengzhang0 commented Dec 6, 2021

ejona86 commented Dec 7, 2021

dapengzhang0 commented Dec 7, 2021

ejona86 commented Dec 14, 2021

Spikhalskiy commented Nov 18, 2021 •

edited by dapengzhang0

dapengzhang0 commented Nov 19, 2021 •

edited

Spikhalskiy commented Nov 19, 2021 •

edited

ejona86 commented Nov 20, 2021 •

edited

dapengzhang0 commented Nov 22, 2021 •

edited

Spikhalskiy commented Nov 23, 2021 •

edited

Spikhalskiy commented Nov 23, 2021 •

edited