Graceful shutdown may hang indefinitely if node crashed #18329

gafeol · 2023-07-18T14:23:27Z

I'm using Trino 406, configured with 30s as the shutdown.grace-period.

One of our nodes crashed with a "java.lang.OutOfMemoryError: Java heap space" error.
We restarted it and continued using this worker normally on the cluster until, a couple hours later, we decided to terminate it.
We sent a graceful shutdown request to that node and confirmed it had switched into the SHUTTING_DOWN state, but after 7 hours it never actually terminated the Trino process, so we were unable to terminate it gracefully.

From the worker I was able to confirm that it was still on the SHUTTING_DOWN state and that there were no active queries being run by it:

$ curl localhost:8080/v1/info/state
"SHUTTING_DOWN"

$ curl localhost:8080/v1/status
{"nodeId":"presto320","nodeVersion":{"version":"406"},"environment":"production","coordinator":false,"uptime":"10.55h","externalAddress":"x","internalAddress":"x","memoryInfo":{"availableProcessors":64,"pool":{"maxBytes":345744867328,"reservedBytes":0,"reservedRevocableBytes":0,"queryMemoryReservations":{},"queryMemoryAllocations":{},"queryMemoryRevocableReservations":{},"taskMemoryReservations":{},"taskMemoryRevocableReservations":{},"freeBytes":345744867328}},"processors":64,"processCpuLoad":0.011531810534497226,"systemCpuLoad":0.011366032123771604,"heapUsed":243704836488,"heapAvailable":493921239040,"nonHeapUsed":755459576}

Though from this worker logs I could see a single kind of HTTP request being repeated many times over non-stop:

2023-07-18T14:18:46.145Z        <coordinator IP>    GET     /v1/task/20230718_032419_12112_it5df.57.2.0/dynamicfilters      null    null    200     0       40      1       e0f265b13b4e4ed58fd575b060a75b4a0006d3f42d      HTTP/1.1        0       0       0       null
2023-07-18T14:18:46.146Z        <coordinator IP>    GET     /v1/task/20230718_032419_12112_it5df.57.2.0/dynamicfilters      null    null    200     0       40      0       e0f265b13b4e4ed58fd575b060a75b4a0006d3f42e      HTTP/1.1        0       0       0       null
2023-07-18T14:18:46.146Z        <coordinator IP>    GET     /v1/task/20230718_032419_12112_it5df.57.2.0/dynamicfilters      null    null    200     0       40      0       e0f265b13b4e4ed58fd575b060a75b4a0006d3f42f      HTTP/1.1        0       0       0       null

This contains the query id of a query that failed when this node had crashed due to the OOM issue, and this seems to be blocking the node's shutdown, even though the query is not running anymore.

The text was updated successfully, but these errors were encountered:

hashhar · 2023-07-19T11:17:23Z

cc: @raunaqmorarka

raunaqmorarka · 2023-07-20T07:36:45Z

cc: @sopel39

sopel39 · 2023-07-21T08:04:53Z

Would it be possible to call /v1/task on a worker that got stuck with SHUTTING_DOWN?

We sent a graceful shutdown request to that node and confirmed it had switched into the SHUTTING_DOWN state, but after 7 hours it never actually terminated the Trino process, so we were unable to terminate it gracefully.

Is it the instance of the worker that failed with java.lang.OutOfMemoryError: Java heap space or a new worker?

It could be that the query is still active on coordionator. Since SqlTask sqlTask = tasks.getUnchecked(taskId); will create a new task on worker, it might be that there is a race between query being pruned from coordinator and new worker starting up. Once new worker starts, coordinator might ping it for failed task status and thus implicity create a new task on that worker. @dain @electrum we might need to attach worker id to task requests to prevent spamming new tasks on new worker that were meant to go to old worker.

As a workaround I think it's enough to have some cooldown period between old worker being killed and starting new worker.

gafeol · 2023-07-21T12:16:18Z

Would it be possible to call /v1/task on a worker that got stuck with SHUTTING_DOWN?

Will share the output of it if I'm able to reproduce the issue.

We sent a graceful shutdown request to that node and confirmed it had switched into the SHUTTING_DOWN state, but after 7 hours it never actually terminated the Trino process, so we were unable to terminate it gracefully.

Is it the instance of the worker that failed with java.lang.OutOfMemoryError: Java heap space or a new worker?

It is the same worker that failed with OOM that got stuck waiting for that failed query to finish. Soon after it crashed the worker was restarted and continued running new queries. Hours later when we tried gracefully shutting it down we found it to be stuck in this state.

gafeol · 2023-11-13T17:38:11Z

Saw the same issue again, now at trino 422, one worker crashed due to OOM and later never left the shutting_down state.

> curl localhost:8080/v1/status
{"nodeId":"presto180","nodeVersion":{"version":"422"},"environment":"production","coordinator":false,"uptime":"2.18d","externalAddress":"x","internalAddress":"x","memoryInfo":{"availableProcessors":64,"pool":{"maxBytes":327491256320,"reservedBytes":0,"reservedRevocableBytes":0,"queryMemoryReservations":{},"queryMemoryAllocations":{},"queryMemoryRevocableReservations":{},"taskMemoryReservations":{},"taskMemoryRevocableReservations":{},"freeBytes":327491256320}},"processors":64,"processCpuLoad":0.0,"systemCpuLoad":0.058823529411764705,"heapUsed":1350440200,"heapAvailable":467077693440,"nonHeapUsed":539366608}

Seeing on the http request logs of this worker a request being made to a task from a query from 2 days ago which failed when the node crashed:

2023-11-13T17:22:13.229Z        <coord IP>      GET     /v1/task/20231111_122121_17965_22q2q.41.10.1?summarize  null    null    200     0       2038    1       b664770797e14d169505a31ad319e275001f7f79f0      HTTP/1.1        0       0       0       nul

I only got 403 Forbidden trying to access data from /v1/task as suggested. Tried both curl from the stuck worker, from the coordinator and accessing it from the UI at: /ui/api/worker/presto180/task/20231111_122121_17965_22q2q.41.10.1.

sopel39 · 2023-11-14T17:33:07Z

@gafeol

As a workaround I think it's enough to have some cooldown period between old worker being killed and starting new worker.

Did you implement a cooldown period?

gafeol · 2023-11-16T13:17:46Z

As a workaround I think it's enough to have some cooldown period between old worker being killed and starting new worker.

Sorry, I don't understand what the cooldown period would mean. Do you mean implementing a longer wait time before restarting a node which crashed?
Right now we have a monitoring service that would restart Trino if it's crashed in at most 5 minutes.

Just to clarify, the "stuck graceful shutdown" issue happened on the old worker after it was restarted from the OOM crash. There was no issue with new workers started before or after the crash.

sopel39 · 2023-11-16T14:50:44Z

Sorry, I don't understand what the cooldown period would mean. Do you mean implementing a longer wait time before restarting a node which crashed?
Right now we have a monitoring service that would restart Trino if it's crashed in at most 5 minutes.

Yes, something like waiting 1 min before restarting worker. Is it k8s? Does new worker get same IP?

gafeol · 2023-11-17T16:50:09Z

I see, so the wait time would help guarantee that the coordinator purges the tasks from the failed query before the worker is revived, is that right?
It's not k8s, our workers run on plain EC2s. The instance itself didn't change, it's just the trino process that crashed and restarted, so the IP for the revived worker doesn't change.

sopel39 · 2023-11-20T12:48:04Z

@gafeol do you use FTE? Do you see other requests or errors apart from

/v1/task/20230718_032419_12112_it5df.57.2.0/dynamicfilters

technically, dynamicfilters request shouldn't happen continuously. I wonder if this isn't somehow releated to b3038d9? cc @raunaqmorarka

sopel39 · 2023-11-20T12:49:25Z

What is your task.info.max-age?

sopel39 · 2023-11-20T13:06:43Z

I see, so the wait time would help guarantee that the coordinator purges the tasks from the failed query before the worker is revived, is that right?

Yes, if you could try that and it helps then it would give us some hints. I'm marking this issue as bug since dynamicfilters shouldn't be pinged continously.

gafeol · 2023-11-21T16:53:27Z

@gafeol do you use FTE? Do you see other requests or errors apart from

To clarify I believe we were not using FTE when first reporting this incident. That is, the requests for dynamicfilters came from a non-FTE cluster.

But the latest report I shared #18329 (comment) did happen on a FTE cluster.

What is your task.info.max-age?

We're using the default of 15min

Yes, if you could try that and it helps then it would give us some hints. I'm marking this issue as bug since dynamicfilters shouldn't be pinged continuously.

Sure, will test this and share any information I get, though this has been quite a rare issue as we're working to avoid any OOM crashes.

sopel39 · 2023-11-22T08:27:24Z

But the latest report I shared #18329 (comment) did happen on a FTE cluster.

cc @losipiuk for FTE graceful shutdown issue

sopel39 · 2024-05-13T21:03:03Z

@gafeol do you still see the issue happening?

sopel39 · 2024-05-13T21:14:05Z

I think I know what's happening:

coordinator receives some information about new version of DF on worker
worker fails with OOM
worker is restarted
DynamicFiltersCollector keeps pinging new worker for DF on dead task, thus keeping the task not done nor abandoned

findepi · 2024-05-13T21:46:17Z

4. DynamicFiltersCollector keeps pinging new worker for DF on dead task, thus keeping the task not done nor abandoned

#21744 could solve this, but this is currently on hold

sopel39 · 2024-05-13T21:46:58Z

DynamicFiltersFetcher needs to be stopped with HttpRemoteTask cleanup to solve this

gafeol closed this as completed Nov 13, 2023

gafeol reopened this Nov 13, 2023

sopel39 added the bug Something isn't working label Nov 20, 2023

sopel39 mentioned this issue May 13, 2024

Failed worker doesn't handle requests properly on recovery #21735

Closed

sopel39 mentioned this issue May 14, 2024

Stop fetching of dynamic filters after failure or task mismatch #21965

Merged

sopel39 closed this as completed in #21965 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown may hang indefinitely if node crashed #18329

Graceful shutdown may hang indefinitely if node crashed #18329

gafeol commented Jul 18, 2023 •

edited

hashhar commented Jul 19, 2023

raunaqmorarka commented Jul 20, 2023

sopel39 commented Jul 21, 2023 •

edited

gafeol commented Jul 21, 2023

gafeol commented Nov 13, 2023

sopel39 commented Nov 14, 2023

gafeol commented Nov 16, 2023

sopel39 commented Nov 16, 2023

gafeol commented Nov 17, 2023

sopel39 commented Nov 20, 2023 •

edited

sopel39 commented Nov 20, 2023

sopel39 commented Nov 20, 2023

gafeol commented Nov 21, 2023 •

edited

sopel39 commented Nov 22, 2023

sopel39 commented May 13, 2024

sopel39 commented May 13, 2024

findepi commented May 13, 2024

sopel39 commented May 13, 2024

Graceful shutdown may hang indefinitely if node crashed #18329

Graceful shutdown may hang indefinitely if node crashed #18329

Comments

gafeol commented Jul 18, 2023 • edited

hashhar commented Jul 19, 2023

raunaqmorarka commented Jul 20, 2023

sopel39 commented Jul 21, 2023 • edited

gafeol commented Jul 21, 2023

gafeol commented Nov 13, 2023

sopel39 commented Nov 14, 2023

gafeol commented Nov 16, 2023

sopel39 commented Nov 16, 2023

gafeol commented Nov 17, 2023

sopel39 commented Nov 20, 2023 • edited

sopel39 commented Nov 20, 2023

sopel39 commented Nov 20, 2023

gafeol commented Nov 21, 2023 • edited

sopel39 commented Nov 22, 2023

sopel39 commented May 13, 2024

sopel39 commented May 13, 2024

findepi commented May 13, 2024

sopel39 commented May 13, 2024

gafeol commented Jul 18, 2023 •

edited

sopel39 commented Jul 21, 2023 •

edited

sopel39 commented Nov 20, 2023 •

edited

gafeol commented Nov 21, 2023 •

edited