New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataproc: RejectedExecutionException: event executor terminated #5810
Comments
The RejectedExecutionException thrown during DelayedClientTransport.reprocess looked like it was caused by the executor (provided by @xCASx, sorry, I had missed/forgotten about the eventfd_write. That seems likely to be netty/netty#9362 , but we've not actually seen any reports of that failure by users so I'd expect it to be very rare. I would have expected that to be gracefully-recovered... but looking at the code it does seem it could hang future RPCs because there is a lack of try-finally for |
Maybe it's not so rare as you think. We run into exactly the same problem. However, we are using the google pubsub-library, but it uses the same infrastructure below, I guess. |
The gRPC bug was fixed just a moment ago: grpc/grpc-java#6002 |
Note that the gRPC fix does not fix the underlying problem. But it will allow the client to function after this happens, instead of hanging. |
So here's what's happening here. When a This is a gax-java bug. |
@chingor13, should this be closed as googleapis/gax-java#787 is merged? |
I wonder if I'm hitting this issue or a similar one? Using GRPC-java 1.26.0, though I tried downgrading to 1.25.0 since that seems to match other dependency versions in I'm attempting to use com.google.cloud:google-cloud-texttospeech:0.117.1-beta in a GRPC server. The server has worked well so far, but now I have a server method attempting to use a GRPC client, at which point things start breaking. One interesting bit I'm noticing in contrast with the above stacktraces is the line:
Specifically, The stacktrace:
|
Thanks for stopping by to let us know something could be better!
PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.
Please run down the following list and make sure you've tried the usual "quick fixes":
If you are still having issues, please be sure to include as much information as possible:
Environment details
General, Core, and Other are also allowed as types
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (Zulu 8.40.0.25-CA-linux64) (build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (Zulu 8.40.0.25-CA-linux64) (build 25.222-b10, mixed mode)
Steps to reproduce
Code example
Stack trace
External references such as API reference guides used
Any additional information below
After migration to SDK v.0.99.0-alpha from v.0.79.0-alpha we are observing an intermittent bug. After some time, from a few hours to a few days,
io.grpc.netty.shaded.io.netty.channel.ChannelException: eventfd_write() failed: Bad file descriptor
exception happening and all subsequent requests fail withjava.util.concurrent.RejectedExecutionException: event executor terminated
making the whole application instance effectively useless. The stack traces are repetitive (probably due to internal retry logic). We also took a thread dump from one of unhealthy instances: https://pastebin.com/ugx5GiXPWe didn't see anything similar in case of v.0.79.0-alpha for about half a year it was used.
We do not use any tricky configuration for client instantiation. The code snippets above is literally how we do it in our application. The pattern is actually borrowed from this javadoc.
After a talk with @ejona86, it seems like the source of the issue may be prematurely closed
ScheduledExecutorService
, but this requires further investigation.P.S. a bit orthogonal question about best practices of using clients. According to this comment clients are thread safe. We didn't saw this reference during initial implementation and simply stick to example from javadocs. Now we spin up a new instance
ClusterControllerClient
on every request and dispose it right after (using try-with-resources). Is this approach justified in any way, or it is better to share client for different requests and reuse it as long as possible?The text was updated successfully, but these errors were encountered: