Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needs investigation: channel panic upon shutdown after xds call completes #7942

Closed
YifeiZhuang opened this issue Mar 5, 2021 · 3 comments · Fixed by #7971
Closed

Needs investigation: channel panic upon shutdown after xds call completes #7942

YifeiZhuang opened this issue Mar 5, 2021 · 3 comments · Fixed by #7971
Assignees
Labels
Milestone

Comments

@YifeiZhuang
Copy link
Contributor

YifeiZhuang commented Mar 5, 2021

zivy@grpc-client-zivy:~/grpc-java/examples/example-xds$ ./build/install/example-xds/bin/xds-hello-world-client "world"     xds:///grpc.weighted.io
Mar 05, 2021 6:21:45 PM io.grpc.examples.helloworldxds.XdsHelloWorldClient greet
INFO: Will try to greet world ...
Mar 05, 2021 6:21:46 PM com.google.auth.oauth2.DefaultCredentialsProvider warnAboutProblematicCredentials
WARNING: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from
 Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/.
Mar 05, 2021 6:21:49 PM io.grpc.examples.helloworldxds.XdsHelloWorldClient greet
INFO: Greeting: Hello world, from grpc-td-mig-us-central1-zivy-wp1n
Mar 05, 2021 6:21:49 PM io.grpc.internal.ManagedChannelImpl$2 uncaughtException
SEVERE: [Channel<1>: (xds:///grpc.weighted.io)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.IllegalStateException: Channel is being terminated
        at com.google.common.base.Preconditions.checkState(Preconditions.java:511)
        at io.grpc.internal.ManagedChannelImpl$LbHelperImpl.createSubchannel(ManagedChannelImpl.java:1423)
        at io.grpc.internal.ManagedChannelImpl$LbHelperImpl.createSubchannel(ManagedChannelImpl.java:1416)
        at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:45)
        at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:45)
        at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:45)
        at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:45)
        at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:45)
        at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:45)
        at io.grpc.xds.ClusterImplLoadBalancer$ClusterImplLbHelper.createSubchannel(ClusterImplLoadBalancer.java:222)
        at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:45)
        at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:45)
        at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:45)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HelperImpl.createSubchannel(HealthCheckingLoadBalancerFactory.java:117)
        at io.grpc.util.RoundRobinLoadBalancer.handleResolvedAddresses(RoundRobinLoadBalancer.java:101)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HealthCheckingLoadBalancer.handleResolvedAddresses(HealthCheckingLoadBalancerFactory.java:189)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.xds.WeightedTargetLoadBalancer.handleResolvedAddresses(WeightedTargetLoadBalancer.java:88)
        at io.grpc.xds.ClusterImplLoadBalancer.handleResolvedAddresses(ClusterImplLoadBalancer.java:135)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.xds.PriorityLoadBalancer$ChildLbState.updateResolvedAddresses(PriorityLoadBalancer.java:268)
        at io.grpc.xds.PriorityLoadBalancer.tryNextPriority(PriorityLoadBalancer.java:135)
        at io.grpc.xds.PriorityLoadBalancer.handleResolvedAddresses(PriorityLoadBalancer.java:102)
        at io.grpc.xds.ClusterResolverLoadBalancer$ClusterResolverLbState.handleEndpointResourceUpdate(ClusterResolverLoadBalancer.java:251)
        at io.grpc.xds.ClusterResolverLoadBalancer$ClusterResolverLbState.access$1700(ClusterResolverLoadBalancer.java:155)
        at io.grpc.xds.ClusterResolverLoadBalancer$ClusterResolverLbState$EdsClusterState$1EndpointsUpdated.run(ClusterResolverLoadBalancer.java:427)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.ClusterResolverLoadBalancer$ClusterResolverLbState$EdsClusterState.onChanged(ClusterResolverLoadBalancer.java:431)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.notifyWatcher(ClientXdsClient.java:1408)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.onData(ClientXdsClient.java:1367)
        at io.grpc.xds.ClientXdsClient.handleEdsResponse(ClientXdsClient.java:973)
        at io.grpc.xds.AbstractXdsClient$AbstractAdsStream.handleRpcResponse(AbstractXdsClient.java:501)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1$1.run(AbstractXdsClient.java:583)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:575)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:572)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:465)
        at io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:447)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:652)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:637)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
@YifeiZhuang YifeiZhuang self-assigned this Mar 5, 2021
@ejona86
Copy link
Member

ejona86 commented Mar 9, 2021

CC @voidzcy

@ejona86 ejona86 added this to the Next milestone Mar 9, 2021
@ejona86 ejona86 added the bug label Mar 9, 2021
@YifeiZhuang
Copy link
Contributor Author

YifeiZhuang commented Mar 12, 2021

Finer stack trace here
It looks childlb(clusterResolverLb) might need to be called shutdown as well. Otherwise it would cause xds client channels to be shutdown prematurely while EdsWatcher is still alive. Note the EdsClusterState.onChanged stack trace.
It looks the channel panic issue is disappearing after certain number of executions, with the change:
https://github.com/grpc/grpc-java/blob/master/xds/src/main/java/io/grpc/xds/CdsLoadBalancer2.java#L110

+      if (childLb != null) {
+        childLb.shutdown();
+      }

@voidzcy could you remind me of the shutdown path of clusterResolverLb?

@voidzcy
Copy link
Contributor

voidzcy commented Mar 12, 2021

Ah, you are probably right. It turns out the shutdown operation is not hooked correctly between the CDS LB policy and ClusterResolver LB policy. During Channel shutdown, it shuts down the resolver, then the top-level load balancer (AKA ClusterManagerLoadBalancer). The top-level load balancer's shutdown triggers shutdown for the entire LB tree. This shutdown process is recursive, so each LB policy should call shutdown() for its child/children LB policy(ies). What you pointed out looks exactly what is missing 👍

@ejona86 ejona86 modified the milestones: Next, 1.37 Mar 16, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants