Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal error in gRPC channel of WorkflowServiceStubs #863

Closed
smax48 opened this issue Nov 9, 2021 · 2 comments · Fixed by #889
Closed

Internal error in gRPC channel of WorkflowServiceStubs #863

smax48 opened this issue Nov 9, 2021 · 2 comments · Fixed by #889
Assignees
Labels
bug Something isn't working
Milestone

Comments

@smax48
Copy link

smax48 commented Nov 9, 2021

Expected Behavior

A single instance of WorkflowServiceStubs is created once per application and used for communication with Temporal frontend API.

Actual Behavior

After long period of inactivity (~40min) at some point of time we see an internal exception when trying to start a new workflow: io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug! and channel becomes unusable after that (i.e. all attempts to use it result in the same error.)

I cannot prove, but it might be related to some race condition between idleTimer in the netty channel (which has 30 min default timeout and is not configured by Temporal SDK) and Temporal own mechanism to reset gRPC connections via setGrpcReconnectFrequency() (that also calls to enterIdle() on the channel). I am not sure why the existing idleTimer is not used for this purpose in the first place, but maybe there are some specific considerations.

Full stack trace:

io.temporal.client.WorkflowServiceException: workflowId='XXX123', runId='', workflowType='SomeWorkflowType'}\
at io.temporal.internal.sync.WorkflowStubImpl.wrapStartException(WorkflowStubImpl.java:184)\
at io.temporal.internal.sync.WorkflowStubImpl.startWithOptions(WorkflowStubImpl.java:120)\
at io.temporal.internal.sync.WorkflowStubImpl.start(WorkflowStubImpl.java:138)\
at io.temporal.internal.sync.WorkflowInvocationHandler.startWorkflow(WorkflowInvocationHandler.java:192)\
at io.temporal.internal.sync.WorkflowInvocationHandler.access$300(WorkflowInvocationHandler.java:48)\
at io.temporal.internal.sync.WorkflowInvocationHandler$ExecuteWorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:339)\
at io.temporal.internal.sync.WorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:178)\
<app-specific call stack omitted>
............
at java.base/java.lang.Thread.run(Unknown Source)\
Caused by: io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug!\
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\
at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.startWorkflowExecution(WorkflowServiceGrpc.java:2615)\
at io.temporal.internal.external.GenericWorkflowClientExternalImpl.lambda$start$0(GenericWorkflowClientExternalImpl.java:88)\
at io.temporal.internal.retryer.GrpcSyncRetryer.retry(GrpcSyncRetryer.java:61)\
at io.temporal.internal.retryer.GrpcRetryer.retryWithResult(GrpcRetryer.java:51)\
at io.temporal.internal.external.GenericWorkflowClientExternalImpl.start(GenericWorkflowClientExternalImpl.java:81)\
at io.temporal.internal.client.RootWorkflowClientInvoker.start(RootWorkflowClientInvoker.java:55)\
at io.temporal.common.interceptors.WorkflowClientCallsInterceptorBase.start(WorkflowClientCallsInterceptorBase.java:35)\
at io.temporal.opentracing.internal.OpenTracingWorkflowClientCallsInterceptor.start(OpenTracingWorkflowClientCallsInterceptor.java:50)\
at io.temporal.internal.sync.WorkflowStubImpl.startWithOptions(WorkflowStubImpl.java:113)\
... 25 common frames omitted\
Caused by: java.lang.IllegalStateException: nameResolver is not started\
at com.google.common.base.Preconditions.checkState(Preconditions.java:502)\
at io.grpc.internal.ManagedChannelImpl.shutdownNameResolverAndLoadBalancer(ManagedChannelImpl.java:360)\
at io.grpc.internal.ManagedChannelImpl.enterIdleMode(ManagedChannelImpl.java:422)\
at io.grpc.internal.ManagedChannelImpl.access$900(ManagedChannelImpl.java:118)\
at io.grpc.internal.ManagedChannelImpl$IdleModeTimer.run(ManagedChannelImpl.java:352)\
at io.grpc.internal.Rescheduler$ChannelFutureRunnable.run(Rescheduler.java:103)\
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)\
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)\
at io.grpc.internal.Rescheduler$FutureRunnable.run(Rescheduler.java:80)\
at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\
at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\
.............

Specifications

  • Version: Java SDK 1.4.0
@Spikhalskiy
Copy link
Contributor

I was able to reproduce the bug within grpc-java source code without any Temporal code and submitted grpc/grpc-java#8714

Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this issue Nov 19, 2021
…RPC-java

Improve connection management related documentation
Issue temporalio#863
Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this issue Nov 19, 2021
…RPC-java

Improve connection management related documentation
Issue temporalio#863
Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this issue Nov 19, 2021
…RPC-java

Improve connection management related documentation
Issue temporalio#863
Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this issue Nov 19, 2021
…RPC-java

Improve connection management related documentation
Issue temporalio#863
Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this issue Nov 19, 2021
…RPC-java

Improve connection management related documentation
Issue temporalio#863
Spikhalskiy added a commit that referenced this issue Nov 19, 2021
…RPC-java (#889)

Improve connection management related documentation
Issue #863
@Spikhalskiy
Copy link
Contributor

A temporary workaround is implemented waiting for the related gRPC issue to be fixed.
This workaround will stay around for a while to allow the users to upgrade to the gRPC versions including the fix.
This whole manual calling of enterIdle to achieve load balancing should potentially go away as a part of #888

@Spikhalskiy Spikhalskiy added this to the 1.6.0 milestone Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants