-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful shutdown may hang indefinitely if node crashed #18329
Comments
cc: @raunaqmorarka |
cc: @sopel39 |
Would it be possible to call
Is it the instance of the worker that failed with It could be that the query is still active on coordionator. Since As a workaround I think it's enough to have some cooldown period between old worker being killed and starting new worker. |
Will share the output of it if I'm able to reproduce the issue.
It is the same worker that failed with OOM that got stuck waiting for that failed query to finish. Soon after it crashed the worker was restarted and continued running new queries. Hours later when we tried gracefully shutting it down we found it to be stuck in this state. |
Saw the same issue again, now at trino 422, one worker crashed due to OOM and later never left the shutting_down state.
Seeing on the http request logs of this worker a request being made to a task from a query from 2 days ago which failed when the node crashed:
I only got 403 Forbidden trying to access data from /v1/task as suggested. Tried both |
Did you implement a cooldown period? |
Sorry, I don't understand what the cooldown period would mean. Do you mean implementing a longer wait time before restarting a node which crashed? Just to clarify, the "stuck graceful shutdown" issue happened on the old worker after it was restarted from the OOM crash. There was no issue with new workers started before or after the crash. |
Yes, something like waiting 1 min before restarting worker. Is it k8s? Does new worker get same IP? |
I see, so the wait time would help guarantee that the coordinator purges the tasks from the failed query before the worker is revived, is that right? |
@gafeol do you use FTE? Do you see other requests or errors apart from
technically, |
What is your |
Yes, if you could try that and it helps then it would give us some hints. I'm marking this issue as bug since |
To clarify I believe we were not using FTE when first reporting this incident. That is, the requests for But the latest report I shared #18329 (comment) did happen on a FTE cluster.
We're using the default of 15min
Sure, will test this and share any information I get, though this has been quite a rare issue as we're working to avoid any OOM crashes. |
cc @losipiuk for FTE graceful shutdown issue |
@gafeol do you still see the issue happening? |
I think I know what's happening:
|
#21744 could solve this, but this is currently on hold |
|
I'm using Trino 406, configured with 30s as the
shutdown.grace-period
.One of our nodes crashed with a "java.lang.OutOfMemoryError: Java heap space" error.
We restarted it and continued using this worker normally on the cluster until, a couple hours later, we decided to terminate it.
We sent a graceful shutdown request to that node and confirmed it had switched into the SHUTTING_DOWN state, but after 7 hours it never actually terminated the Trino process, so we were unable to terminate it gracefully.
From the worker I was able to confirm that it was still on the SHUTTING_DOWN state and that there were no active queries being run by it:
Though from this worker logs I could see a single kind of HTTP request being repeated many times over non-stop:
This contains the query id of a query that failed when this node had crashed due to the OOM issue, and this seems to be blocking the node's shutdown, even though the query is not running anymore.
The text was updated successfully, but these errors were encountered: