Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed #8714

Open
Spikhalskiy opened this issue Nov 18, 2021 · 11 comments
Open
Assignees
Milestone

Comments

@Spikhalskiy
Copy link

Spikhalskiy commented Nov 18, 2021

Edit (2021/12/15): The following bug is fixed with a hot patch #8746. However, the logic of idleTimer in ManagedChannel is complex, delicate, and bug-prone. Refactoring is needed to make it more robust and easy to understand.

What version of gRPC-Java are you using?

1.41.0, current master: 8382bd8

What is your environment?

Java 11/16, MacOS. Original user report is probably from a Linux server.

What did you expect to see?

We trigger ManagedChannel#enterIdle API periodically while also keeping ManagedChannelImpl#idleTimer with default settings. We expect ManagedChannelImpl to work properly and not to end up in panic.

What did you see instead?

Caused by: io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug!\
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\
...

Caused by: java.lang.IllegalStateException: nameResolver is not started\
at com.google.common.base.Preconditions.checkState(Preconditions.java:502)\
at io.grpc.internal.ManagedChannelImpl.shutdownNameResolverAndLoadBalancer(ManagedChannelImpl.java:360)\
at io.grpc.internal.ManagedChannelImpl.enterIdleMode(ManagedChannelImpl.java:422)\
at io.grpc.internal.ManagedChannelImpl.access$900(ManagedChannelImpl.java:118)\
at io.grpc.internal.ManagedChannelImpl$IdleModeTimer.run(ManagedChannelImpl.java:352)\
at io.grpc.internal.Rescheduler$ChannelFutureRunnable.run(Rescheduler.java:103)\
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)\
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)\
at io.grpc.internal.Rescheduler$FutureRunnable.run(Rescheduler.java:80)\
at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\
at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\

The channel is broken after it.

Steps to reproduce the bug

  1. Apply the change to io.grpc.testing.integration.XdsTestClient:
    Spikhalskiy@86a0000
  2. Start io.grpc.testing.integration.XdsTestServer#main
  3. Start io.grpc.testing.integration.XdsTestClient#main with --num_channels=1000
  4. Find the reported exception in XdsTestClient logs

Relevant report

The original user report in Temporal JavaSDK: temporalio/sdk-java#863
According to the report, the issue happened after a long period of channel inactivity (> idleTimer period).
My best guess is that our periodic #enterIdle shuts down the nameResolver while still-scheduled (cancel(permanent=false)) idleTimer sees a stale enabled=true and performs the action. But I don't see how this can happen.

Spikhalskiy added a commit to Spikhalskiy/grpc-java that referenced this issue Nov 19, 2021
Spikhalskiy added a commit to Spikhalskiy/grpc-java that referenced this issue Nov 19, 2021
@Spikhalskiy Spikhalskiy changed the title Possible race condition in idleTimer & ManagedChannel#enterIdle Race condition in idleTimer & ManagedChannel#enterIdle Nov 19, 2021
@dapengzhang0
Copy link
Member

dapengzhang0 commented Nov 19, 2021

@ejona86 This seems a bug. Looking at the code, I don't really understand the comment
https://github.com/grpc/grpc-java/blob/v1.42.1/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L418-L422

It seems not true if enterIdle() is called even before the resolver gets started.

@Spikhalskiy
Copy link
Author

Spikhalskiy commented Nov 19, 2021

Also, kinda irrelevant, but to work around this I have to temporary disable idleTimer and found this code in ManagedChannelImplBuilder#idleTimeout:

    if (unit.toDays(value) >= IDLE_MODE_MAX_TIMEOUT_DAYS) {
      // This disables idle mode
      this.idleTimeoutMillis = ManagedChannelImpl.IDLE_TIMEOUT_MILLIS_DISABLE;

So to disable the idleTimer, I need to pass a value larger than 30 days. Passing ManagedChannelImpl.IDLE_TIMEOUT_MILLIS_DISABLE in (which is -1) doesn't work and leads to an exception.
Would we great if ManagedChannelBuilder provides a better interface for this. I would personally prefer having a public constant equals to -1 and mentioning if this in the ManagedChannelBuilder#idleTimeout javadoc.

@ejona86
Copy link
Member

ejona86 commented Nov 20, 2021

It seems not true if enterIdle() is called even before the resolver gets started.

Why do you say that? NameResolver is initialized in the constructor.

this.nameResolver = getNameResolver(

It looks like this is a timer bug. The timer firing is racing with the channel being forced into idle (which cancels the timer) by the outside caller.

enterIdle() has a check to avoid re-entering idle mode (lbHelper == null):

public void enterIdle() {
final class PrepareToLoseNetworkRunnable implements Runnable {
@Override
public void run() {
if (shutdown.get() || lbHelper == null) {
return;
}
cancelIdleTimer(/* permanent= */ false);
enterIdleMode();
}
}
syncContext.execute(new PrepareToLoseNetworkRunnable());
}

But the idle timer doesn't, nor does Rescheduler, as it just treats the SynchronizationContext as an Executor.

private class IdleModeTimer implements Runnable {
@Override
public void run() {
enterIdleMode();
}
}

We need to either make the timer cancellation non-racy, or we need to add double-check behavior in IdleModeTimer. I generally prefer making timer cancellation non-racy, but Rescheduler will probably make that more annoying.


For disabling idleTimeout, the expectation is you'd pass Long.MAX_VALUE or the like. Seems we are missing language like we have on keepAliveTime. We should add something similar to idleTimeout:

Long.MAX_VALUE nano seconds or an unreasonably large value will disable keepalive

@dapengzhang0
Copy link
Member

dapengzhang0 commented Nov 22, 2021

It seems not true if enterIdle() is called even before the resolver gets started.

Why do you say that? NameResolver is initialized in the constructor.

Sorry, I mixed up NameResolver being null, and NameResolver not started. The comment in the code is correct, the NameResolver is not null. My comment should have been:

The following checkState seems not true if enterIdle() is called even before the resolver gets started:

checkState(nameResolverStarted, "nameResolver is not started");

@dapengzhang0 dapengzhang0 added this to the Next milestone Nov 22, 2021
@dapengzhang0 dapengzhang0 self-assigned this Nov 22, 2021
@ejona86
Copy link
Member

ejona86 commented Nov 22, 2021

The following checkState seems not true if enterIdle() is called even before the resolver gets started:

If the resolver is not started then the channel is IDLE, thus there's no point in calling enterIdle(). That's why the current semantics are enterIdle() should only be called if non-IDLE. If it is being called when already IDLE (which is a bug in the caller), there are two possible approaches: 1) stop doing that; fix the code that is calling it inappropriately or 2) change its semantics to allow calling even when IDLE.

@Spikhalskiy
Copy link
Author

Spikhalskiy commented Nov 23, 2021

@ejona86
Approach 2 is already implemented.
Calling enterIdle is already fine if the channel is IDLE. There is a check inside enterIdle for lbHelper==null that makes sure enterIdle does nothing if already IDLE.

Thoughts about why Approach 1 ("stop doing that; fix the code that is calling it inappropriately") is not solving anything:

  1. Following the exception stacktrace, most likely enterIdle is called before idleTimer and the race happens after enterIdle before idleTimer. It's a bit unclear how checking for IDLE before enterIdle from the caller will help here. (exception about the unexpected state is coming from the idleTimer and not enterIdle).
  2. It will not help either because there is effectively a check inside enterIdle for IDLE state that is running under SynchronizationContext (
    if (shutdown.get() || lbHelper == null) {
    ). And additional check outside can't help here. So it's not a solution. The problem is race, not double call to enterIdle.
  3. ManagedChannel implementation is supposed to be thread-safe. The Approach4. 1 requirement to perform a check for IDLE before calling enterIdle will require synchronization (because idleTimer can fire in between) which makes it non-thread-safe.
  4. And what is our caller even supposed to use for synchronization described in (3) to make solution 1 happen?

Disabling idleTimer in the provided reproduction fixes the test. So, the problem is obviously not in enterIdle contract, the issue is in race condition.
The problem is PROBABLY is that there is no check inside idleTimer and cancelation of idleTimer inside enterIdle races with its scheduling/executing. So, the absence of check during execution + race between cancelation of scheduling and scheduled execution itself may lead to the race.

@Spikhalskiy
Copy link
Author

Spikhalskiy commented Nov 23, 2021

This change in IdleTimer fixes the race and saves all the current contracts:
Spikhalskiy@6d260b8

But I'm not sure if it's the right fix or the race should be addressed somewhere inside the io.grpc.internal.Rescheduler or io.grpc.SynchronizationContext.
Because I would expect that

  1. Rescheduler#cancel(permanent = false) call that sets enabled=false for idleTimer inside enterIdle and
  2. idleTimer's check of the enabled flag inside ChannelFutureRunnable#run

don't race with each other. Because they both are run under SynchronizationContext#execute. And I would expect that SynchronizationContext guarantees happens-before between executions of Runnables. Which looks like it attempts to provide by using a queue to get happens-before between add and poll and by using CAS to get happens-before between the processing of polls that could happen in different threads.

But obviously, according to the fact that Spikhalskiy@6d260b8 fixes the issue, something is missing. We probably read a stale version of enabled and that's why this additional lbHelper==null check in idleTimer helps.
So, that's why I don't propose Spikhalskiy@6d260b8 as a fix because this looks like something may be broken with SynchronizationContext and this change may be treating a symptom instead of a bigger issue.

@dapengzhang0
Copy link
Member

The channel is initially IDLE and if the idleTimer expires too soon, it will try to enterIdleMode() at the initial IDLE state.

I think Spikhalskiy/grpc-java@6d260b8 is the right fix.

@ejona86 what do you think?

@ejona86
Copy link
Member

ejona86 commented Dec 7, 2021

@dapengzhang0, I think checking for lbHelper == null would fix this bug. It does make some parts of the code unclear (e.g., why wouldn't enterIdleMode() just do the check?) and can hide other bugs. But it is fast and easy and seems a fair initial fix.

@dapengzhang0
Copy link
Member

But I'm not sure if it's the right fix or the race should be addressed somewhere inside the io.grpc.internal.Rescheduler or io.grpc.SynchronizationContext.
Because I would expect that

  1. Rescheduler#cancel(permanent = false) call that sets enabled=false for idleTimer inside enterIdle and
  2. idleTimer's check of the enabled flag inside ChannelFutureRunnable#run

don't race with each other. Because they both are run under SynchronizationContext#execute.

I think the cause of enabled spuriously becoming to true might be:

When enterIdleMode(), the ManagedChannelImpl.inUseStateAggregator was not checking

if (inUseStateAggregator.anyObjectInUse(pendingCallsInUseObject, delayedTransport)) {

the in-use status of the InternalSubchannel, unlike exitIdleMode(). So even though Rescheduler#cancel(permanent = false) sets enabled=false, when InternalSubchannel becomes not in-use, rescheduleIdleTimer(), namely Rescheduler.enabled=true, can be called by

protected void handleNotInUse() {
if (shutdown.get()) {
return;
}
rescheduleIdleTimer();
}
}

@ejona86
Copy link
Member

ejona86 commented Dec 14, 2021

We believe this is a regression introduced by #8425, which was part of 1.41.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants