Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC Java service stops forwarding requests to handler and instead automatically cancels the request #11112

Closed
SoftMemes opened this issue Apr 18, 2024 · 6 comments

Comments

@SoftMemes
Copy link

What version of gRPC-Java are you using?

1.62.2

What is your environment?

Observed on deployment running on eclipse-temurin:21 as base image, deployed on GKE.

What did you expect to see?

I have a gRPC bidirectional streaming service that keeps the connection over for an extended period of time (10 minutes). This works as intended but after the server has been running for a while, it enters a state where I stop receiving inbound calls.

What did you see instead?

I have an interceptor that logs all requests and observe a call to interceptCall, but never see the log output from my actual method handler. Instead a while later I observe a call to the onCancel() callback of a call listener attached to the call.

I have other gRPC services (such as a health check) that do continue to operate correctly.

Steps to reproduce the bug

Unfortunately I do not have a clean repro but would appreciate any suggestions as to how to troubleshoot this further. I've only see this in production, and only after an instance has been running for an extended period of time. Once this "zombie" state is entered, the service does not recover until forcefully restarted.

@ejona86
Copy link
Member

ejona86 commented Apr 19, 2024

I have an interceptor that logs all requests and observe a call to interceptCall, but never see the log output from my actual method handler.

For bidi, if the interceptor sees the RPC but not the service handler, that means an interceptor is preventing it from getting to the handler. gRPC isn't really involved once the interceptors start running; there's some small amount of stub code, but it is just an adapter.

For unary and server-streaming, the service handler is delayed until it gets the single message and half close. So that can explain certain cases of an interceptor seeing something and the handler not.

I have a gRPC bidirectional streaming service that keeps the connection over for an extended period of time (10 minutes).

You probably want keepalive enabled.

@SoftMemes
Copy link
Author

For bidi, if the interceptor sees the RPC but not the service handler, that means an interceptor is preventing it from getting to the handler. gRPC isn't really involved once the interceptors start running; there's some small amount of stub code, but it is just an adapter.

Specifically, I am using grpc-kotlin with stubs generated from my protos based on AbstractCoroutineServerImpl. This may one one for grpc-kotlin, but is it possible that some resource starvation would lead to my handler never being scheduled?

You probably want keepalive enabled.

Keepalive is on, thank you!

@ejona86
Copy link
Member

ejona86 commented Apr 23, 2024

I can't speak to grpc-kotlin. I don't know how they handle the coroutines. I'd be surprised if it was all that different from normal Java for the initial call, though.

@sergiitk sergiitk added the Waiting on reporter there was a request for more information without a response or answer or advice has been provided label Apr 25, 2024
@ejona86
Copy link
Member

ejona86 commented May 2, 2024

Seems like we've answered all that we can. It seems best to ask grpc-kotlin folks if they know anything that could be impacting you. If it turns out we might be able to help you more, comment, and the issue can be reopened.

@ejona86 ejona86 closed this as completed May 2, 2024
@SoftMemes
Copy link
Author

For anyone who ends up here with a similar problem, we did hunt this down to a resource leak in the end.

Running on a 2 CPU VM, we ended up with two threads busy looping and allocating resources, causing the GC to also run. With this, it appears that the thread pools were starved and the requests never went to the handler before the client gave up and cancelled.

@ejona86
Copy link
Member

ejona86 commented May 3, 2024

Ah, makes sense. Yes, if the serverBuilder.executor() or channelBuilder.executor() thread pools are exhausted, RPC events would be delayed.

@ejona86 ejona86 removed the Waiting on reporter there was a request for more information without a response or answer or advice has been provided label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants