Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown may hang indefinitely if node crashed #18329

Closed
gafeol opened this issue Jul 18, 2023 · 18 comments · Fixed by #21965
Closed

Graceful shutdown may hang indefinitely if node crashed #18329

gafeol opened this issue Jul 18, 2023 · 18 comments · Fixed by #21965
Labels
bug Something isn't working

Comments

@gafeol
Copy link

gafeol commented Jul 18, 2023

I'm using Trino 406, configured with 30s as the shutdown.grace-period.

One of our nodes crashed with a "java.lang.OutOfMemoryError: Java heap space" error.
We restarted it and continued using this worker normally on the cluster until, a couple hours later, we decided to terminate it.
We sent a graceful shutdown request to that node and confirmed it had switched into the SHUTTING_DOWN state, but after 7 hours it never actually terminated the Trino process, so we were unable to terminate it gracefully.

From the worker I was able to confirm that it was still on the SHUTTING_DOWN state and that there were no active queries being run by it:

$ curl localhost:8080/v1/info/state
"SHUTTING_DOWN"

$ curl localhost:8080/v1/status
{"nodeId":"presto320","nodeVersion":{"version":"406"},"environment":"production","coordinator":false,"uptime":"10.55h","externalAddress":"x","internalAddress":"x","memoryInfo":{"availableProcessors":64,"pool":{"maxBytes":345744867328,"reservedBytes":0,"reservedRevocableBytes":0,"queryMemoryReservations":{},"queryMemoryAllocations":{},"queryMemoryRevocableReservations":{},"taskMemoryReservations":{},"taskMemoryRevocableReservations":{},"freeBytes":345744867328}},"processors":64,"processCpuLoad":0.011531810534497226,"systemCpuLoad":0.011366032123771604,"heapUsed":243704836488,"heapAvailable":493921239040,"nonHeapUsed":755459576}

Though from this worker logs I could see a single kind of HTTP request being repeated many times over non-stop:

2023-07-18T14:18:46.145Z        <coordinator IP>    GET     /v1/task/20230718_032419_12112_it5df.57.2.0/dynamicfilters      null    null    200     0       40      1       e0f265b13b4e4ed58fd575b060a75b4a0006d3f42d      HTTP/1.1        0       0       0       null
2023-07-18T14:18:46.146Z        <coordinator IP>    GET     /v1/task/20230718_032419_12112_it5df.57.2.0/dynamicfilters      null    null    200     0       40      0       e0f265b13b4e4ed58fd575b060a75b4a0006d3f42e      HTTP/1.1        0       0       0       null
2023-07-18T14:18:46.146Z        <coordinator IP>    GET     /v1/task/20230718_032419_12112_it5df.57.2.0/dynamicfilters      null    null    200     0       40      0       e0f265b13b4e4ed58fd575b060a75b4a0006d3f42f      HTTP/1.1        0       0       0       null

This contains the query id of a query that failed when this node had crashed due to the OOM issue, and this seems to be blocking the node's shutdown, even though the query is not running anymore.

@hashhar
Copy link
Member

hashhar commented Jul 19, 2023

cc: @raunaqmorarka

@raunaqmorarka
Copy link
Member

cc: @sopel39

@sopel39
Copy link
Member

sopel39 commented Jul 21, 2023

Would it be possible to call /v1/task on a worker that got stuck with SHUTTING_DOWN?

We sent a graceful shutdown request to that node and confirmed it had switched into the SHUTTING_DOWN state, but after 7 hours it never actually terminated the Trino process, so we were unable to terminate it gracefully.

Is it the instance of the worker that failed with java.lang.OutOfMemoryError: Java heap space or a new worker?

It could be that the query is still active on coordionator. Since SqlTask sqlTask = tasks.getUnchecked(taskId); will create a new task on worker, it might be that there is a race between query being pruned from coordinator and new worker starting up. Once new worker starts, coordinator might ping it for failed task status and thus implicity create a new task on that worker. @dain @electrum we might need to attach worker id to task requests to prevent spamming new tasks on new worker that were meant to go to old worker.

As a workaround I think it's enough to have some cooldown period between old worker being killed and starting new worker.

@gafeol
Copy link
Author

gafeol commented Jul 21, 2023

Would it be possible to call /v1/task on a worker that got stuck with SHUTTING_DOWN?

Will share the output of it if I'm able to reproduce the issue.

We sent a graceful shutdown request to that node and confirmed it had switched into the SHUTTING_DOWN state, but after 7 hours it never actually terminated the Trino process, so we were unable to terminate it gracefully.

Is it the instance of the worker that failed with java.lang.OutOfMemoryError: Java heap space or a new worker?

It is the same worker that failed with OOM that got stuck waiting for that failed query to finish. Soon after it crashed the worker was restarted and continued running new queries. Hours later when we tried gracefully shutting it down we found it to be stuck in this state.

@gafeol gafeol closed this as completed Nov 13, 2023
@gafeol gafeol reopened this Nov 13, 2023
@gafeol
Copy link
Author

gafeol commented Nov 13, 2023

Saw the same issue again, now at trino 422, one worker crashed due to OOM and later never left the shutting_down state.

> curl localhost:8080/v1/status
{"nodeId":"presto180","nodeVersion":{"version":"422"},"environment":"production","coordinator":false,"uptime":"2.18d","externalAddress":"x","internalAddress":"x","memoryInfo":{"availableProcessors":64,"pool":{"maxBytes":327491256320,"reservedBytes":0,"reservedRevocableBytes":0,"queryMemoryReservations":{},"queryMemoryAllocations":{},"queryMemoryRevocableReservations":{},"taskMemoryReservations":{},"taskMemoryRevocableReservations":{},"freeBytes":327491256320}},"processors":64,"processCpuLoad":0.0,"systemCpuLoad":0.058823529411764705,"heapUsed":1350440200,"heapAvailable":467077693440,"nonHeapUsed":539366608}

Seeing on the http request logs of this worker a request being made to a task from a query from 2 days ago which failed when the node crashed:

2023-11-13T17:22:13.229Z        <coord IP>      GET     /v1/task/20231111_122121_17965_22q2q.41.10.1?summarize  null    null    200     0       2038    1       b664770797e14d169505a31ad319e275001f7f79f0      HTTP/1.1        0       0       0       nul

I only got 403 Forbidden trying to access data from /v1/task as suggested. Tried both curl from the stuck worker, from the coordinator and accessing it from the UI at: /ui/api/worker/presto180/task/20231111_122121_17965_22q2q.41.10.1.

@sopel39
Copy link
Member

sopel39 commented Nov 14, 2023

@gafeol

As a workaround I think it's enough to have some cooldown period between old worker being killed and starting new worker.

Did you implement a cooldown period?

@gafeol
Copy link
Author

gafeol commented Nov 16, 2023

As a workaround I think it's enough to have some cooldown period between old worker being killed and starting new worker.

Sorry, I don't understand what the cooldown period would mean. Do you mean implementing a longer wait time before restarting a node which crashed?
Right now we have a monitoring service that would restart Trino if it's crashed in at most 5 minutes.

Just to clarify, the "stuck graceful shutdown" issue happened on the old worker after it was restarted from the OOM crash. There was no issue with new workers started before or after the crash.

@sopel39
Copy link
Member

sopel39 commented Nov 16, 2023

Sorry, I don't understand what the cooldown period would mean. Do you mean implementing a longer wait time before restarting a node which crashed?
Right now we have a monitoring service that would restart Trino if it's crashed in at most 5 minutes.

Yes, something like waiting 1 min before restarting worker. Is it k8s? Does new worker get same IP?

@gafeol
Copy link
Author

gafeol commented Nov 17, 2023

I see, so the wait time would help guarantee that the coordinator purges the tasks from the failed query before the worker is revived, is that right?
It's not k8s, our workers run on plain EC2s. The instance itself didn't change, it's just the trino process that crashed and restarted, so the IP for the revived worker doesn't change.

@sopel39
Copy link
Member

sopel39 commented Nov 20, 2023

@gafeol do you use FTE? Do you see other requests or errors apart from

/v1/task/20230718_032419_12112_it5df.57.2.0/dynamicfilters  

technically, dynamicfilters request shouldn't happen continuously. I wonder if this isn't somehow releated to b3038d9? cc @raunaqmorarka

@sopel39
Copy link
Member

sopel39 commented Nov 20, 2023

What is your task.info.max-age?

@sopel39
Copy link
Member

sopel39 commented Nov 20, 2023

I see, so the wait time would help guarantee that the coordinator purges the tasks from the failed query before the worker is revived, is that right?

Yes, if you could try that and it helps then it would give us some hints. I'm marking this issue as bug since dynamicfilters shouldn't be pinged continously.

@sopel39 sopel39 added the bug Something isn't working label Nov 20, 2023
@gafeol
Copy link
Author

gafeol commented Nov 21, 2023

@gafeol do you use FTE? Do you see other requests or errors apart from

To clarify I believe we were not using FTE when first reporting this incident. That is, the requests for dynamicfilters came from a non-FTE cluster.

But the latest report I shared #18329 (comment) did happen on a FTE cluster.

What is your task.info.max-age?

We're using the default of 15min

Yes, if you could try that and it helps then it would give us some hints. I'm marking this issue as bug since dynamicfilters shouldn't be pinged continuously.

Sure, will test this and share any information I get, though this has been quite a rare issue as we're working to avoid any OOM crashes.

@sopel39
Copy link
Member

sopel39 commented Nov 22, 2023

But the latest report I shared #18329 (comment) did happen on a FTE cluster.

cc @losipiuk for FTE graceful shutdown issue

@sopel39
Copy link
Member

sopel39 commented May 13, 2024

@gafeol do you still see the issue happening?

@sopel39
Copy link
Member

sopel39 commented May 13, 2024

I think I know what's happening:

  1. coordinator receives some information about new version of DF on worker
  2. worker fails with OOM
  3. worker is restarted
  4. DynamicFiltersCollector keeps pinging new worker for DF on dead task, thus keeping the task not done nor abandoned

@findepi
Copy link
Member

findepi commented May 13, 2024

4. DynamicFiltersCollector keeps pinging new worker for DF on dead task, thus keeping the task not done nor abandoned

#21744 could solve this, but this is currently on hold

@sopel39
Copy link
Member

sopel39 commented May 13, 2024

DynamicFiltersFetcher needs to be stopped with HttpRemoteTask cleanup to solve this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

5 participants