Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.IllegalStateException · May only be called before start while using enable Retry #8222

Closed
jchau207 opened this issue May 30, 2021 · 11 comments
Assignees
Labels

Comments

@jchau207
Copy link

What version of gRPC-Java are you using?

io.grpc:grpc-android:1.37.0'

What is your environment?

Android devices. Java 8

What did you expect to see?

I expect the the stream call to be retried if the call to start the stream fails.

What did you see instead?

A small percentage of our users are cashing with the following stack trace:

java.lang.IllegalStateException · May only be called before start
Preconditions.java:511 com.google.common.base.Preconditions.checkState	
DelayedStream.java:377 io.grpc.internal.DelayedStream.setCompressor	
RetriableStream.java:562 io.grpc.internal.RetriableStream$1CompressorEntry.runWith	
RetriableStream.java:287 io.grpc.internal.RetriableStream.drain	
RetriableStream.java:52 io.grpc.internal.RetriableStream.access$900	
RetriableStream.java:820 io.grpc.internal.RetriableStream$Sublistener$1.run	
ThreadPoolExecutor.java:1167 java.util.concurrent.ThreadPoolExecutor.runWorker	
ThreadPoolExecutor.java:641 java.util.concurrent.ThreadPoolExecutor$Worker.run	
Thread.java:919 java.lang.Thread.run

Steps to reproduce the bug

We implemented our channel builder with these configuration:

val config = getRetryingServiceConfig(context)
return AndroidChannelBuilder.forAddress("server url", 443)
                .context(context)
                .keepAliveWithoutCalls(true)
                .defaultServiceConfig(config)
                .enableRetry()
                .keepAliveTime(5, TimeUnit.SECONDS)
                .keepAliveTimeout(5, TimeUnit.SECONDS)
                .useTransportSecurity()
                .intercept(MetadataInterceptor())
                .build()

Our retry configuration is as follow:

{
  "methodConfig": [
    {
      "name": [
        {
          "service": "GeolocationService"
        },
        {
          "service": "ChartsService"
        }
      ],
      "retryPolicy": {
        "maxAttempts": 5,
        "initialBackoff": "0.5s",
        "maxBackoff": "30s",
        "backoffMultiplier": 2,
        "retryableStatusCodes": [
          "UNAVAILABLE"
        ]
      }
    }
  ]
}

Our service calls are stream calls that are built like this:

private val service: EventStreamServiceStub = EventStreamServiceGrpc.newStub(channel)
 val request: GetWalletLiveChartRequest =
            GetWalletLiveChartRequest.newBuilder().build()

        clientWalletCall =
            channel.newCall(ChartsServiceGrpc.getGetWalletLiveChartMethod(), service.callOptions)
@YifeiZhuang YifeiZhuang added the bug label Jun 1, 2021
@YifeiZhuang YifeiZhuang self-assigned this Jun 1, 2021
@YifeiZhuang
Copy link
Contributor

YifeiZhuang commented Jun 1, 2021

Surfaced by change #7750 since v1.36. IIRC there would be race conditions between pending these call and start() if we allow these methods to be called after start, since pending these pre-start calls are not synchronized.
without looking into retry mechanism details, @dapengzhang0 can we enforce retry stream the orders between start() and pre-start calls?

@YifeiZhuang
Copy link
Contributor

@jchau207 how often do you see this issue?

@jchau207
Copy link
Author

jchau207 commented Jun 2, 2021

@YifeiZhuang we encountered 1,262 occurrence over the 260 k sessions. so ~ 0.4% of our users.

@ejona86
Copy link
Member

ejona86 commented Jun 17, 2021

@YifeiZhuang, @dapengzhang0, do we still not have an idea of how this would be happening?

@dapengzhang0
Copy link
Member

dapengzhang0 commented Jun 22, 2021

@ejona86 I did not find any clue. I did not find any possibility that setCompressor() is called after start().

@jchau207 What does MetadataInterceptor() do? Can you share its implementation?

@dapengzhang0
Copy link
Member

dapengzhang0 commented Jun 22, 2021

The issue might be caused by Listener.close() being called more than once as in #7921.

@jchau207 Does the interceptor call Listener.close()?

@jchau207
Copy link
Author

@ejona86 I did not find any clue. I did not find any possibility that setCompressor() is called after start().

@jchau207 What does MetadataInterceptor() do? Can you share its implementation?

@dapengzhang0 this is our MetaDataInterceptor implementation.

class MetadataInterceptor : ClientInterceptor {

    override fun <ReqT, RespT> interceptCall(
        method: MethodDescriptor<ReqT, RespT>,
        callOptions: CallOptions,
        next: Channel
    ): ClientCall<ReqT, RespT> {
        return MetadataAttach(next.newCall(method, callOptions))
    }

    private inner class MetadataAttach<ReqT, RespT> internal constructor(call: ClientCall<ReqT, RespT>?) :
        SimpleForwardingClientCall<ReqT, RespT>(call) {
        override fun start(
            responseListener: Listener<RespT>,
            headers: Metadata
        ) {
            headers.merge(LoggingMetadataBuilder.getMetadata())
            super.start(responseListener, headers)
        }
    }
}

@dapengzhang0
Copy link
Member

dapengzhang0 commented Jun 24, 2021

@jchau207 Thanks for sharing the code. So there is no evidence that Listener.closed() is called more than once in this case. Even if Listener.closed() were called more than once, I still could not see how it's possible that setCompressor() is called after start(). Hmm...

@dapengzhang0 dapengzhang0 added this to the Next milestone Jun 24, 2021
@ejona86 ejona86 removed this from the Next milestone Jul 20, 2021
@jchau207
Copy link
Author

jchau207 commented Sep 8, 2021

@dapengzhang0

Hi, I saw there were some updates and fixes for the retry mechanism in 1.40,

I gave that update a try and we are still seeing a similar error:

java.lang.IllegalStateException: already started
        at com.google.common.base.Preconditions.checkState(Preconditions.java:511)
        at io.grpc.internal.DelayedStream.start(DelayedStream.java:224)
        at io.grpc.internal.RetriableStream$StartEntry.runWith(RetriableStream.java:346)
        at io.grpc.internal.RetriableStream.drain(RetriableStream.java:310)
        at io.grpc.internal.RetriableStream.access$1100(RetriableStream.java:55)
        at io.grpc.internal.RetriableStream$Sublistener$3.run(RetriableStream.java:885)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
        at java.lang.Thread.run(Thread.java:923)

I have attached a dump of the other thread. Hope that would help track down the problem:

thread_pool_dmp.txt

@dapengzhang0
Copy link
Member

@jchau207 Unfortunately I could not reproduce the issue using the code in the above description. I would really appreciate it if you could provide a minimal reproducible complete verifiable example, such as a complete small project in github.

@dapengzhang0
Copy link
Member

No response to provide more information, so closing. More information can be provided and then we can reopen.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants