Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPA not closing gracefully #6676

Open
bdumpp opened this issue Apr 4, 2024 · 6 comments
Open

OPA not closing gracefully #6676

bdumpp opened this issue Apr 4, 2024 · 6 comments
Labels

Comments

@bdumpp
Copy link

bdumpp commented Apr 4, 2024

We are experiencing OPA pods logging errors during shutdown, even when it should be a graceful shutdown because a new pod is being deployed in an old pod's place.
The error message we consistenly get when restarting pods is

{"level":"error","msg":"Status update failed: Post \"[http://slp-name/v1/status\](http://slp/v1/status/)": context canceled.","plugin":"status","time":"2024-04-04T11:38:45+02:00"}

We have noticed this in versions 0.60.0 and 0.62.1.

@bdumpp bdumpp added the bug label Apr 4, 2024
@ashutosh-narkar
Copy link
Member

When a graceful shutdown time is provided, it's taken into account for stopping both the server and plugins. This error is coming from the status plugin while it must be uploading a status update.

Copy link

stale bot commented May 5, 2024

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.

@stale stale bot added the inactive label May 5, 2024
@bdjgs
Copy link
Contributor

bdjgs commented Jun 4, 2024

When running without setting the --shutdown-wait-period flag, the OPA will log the following when being shutdown.

{"level":"info","msg":"Status update sent successfully in response to decision log update.","plugin":"status","time":"2024-06-04T11:56:58Z"}
{"client_addr":"100.78.24.1:53480","level":"info","msg":"Received request.","req_id":9,"req_method":"GET","req_path":"/health","time":"2024-06-04T11:56:59Z"}
{"client_addr":"100.78.24.1:53480","level":"info","msg":"Sent response.","req_id":9,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.431618,"resp_status":200,"time":"2024-06-04T11:56:59Z"}
{"client_addr":"100.78.24.1:53496","level":"info","msg":"Received request.","req_id":10,"req_method":"GET","req_path":"/health","time":"2024-06-04T11:57:01Z"}
{"client_addr":"100.78.24.1:53496","level":"info","msg":"Sent response.","req_id":10,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.348637,"resp_status":200,"time":"2024-06-04T11:57:01Z"}
{"level":"info","msg":"Shutting down...","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Server shutdown.","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Stopping bundle loader.","name":"systems/b9ecfbbd532a470fb1fdb3fefab1f5f7","plugin":"bundle","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Stopping decision logger.","plugin":"decision_logs","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Flushing decision logs.","plugin":"decision_logs","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"All decisions in buffer uploaded.","plugin":"decision_logs","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Stopping status reporter.","plugin":"status","time":"2024-06-04T11:57:03Z"}
{"level":"error","msg":"Status update failed: Post \"http://styrasystem-slp/v1/status\": context canceled.","plugin":"status","time":"2024-06-04T11:57:03Z"}
{"level":"error","msg":"Status update failed: Post \"http://styrasystem-slp/v1/status\": context canceled.","plugin":"status","time":"2024-06-04T11:57:03Z"}

Setting the flag to e.g. 10, it will log the same as above, but with the extra Waiting 10s before initiating shutdown... line. This indicates that 10 seconds is not enough time to shutdown.

However, If I set the --shutdown-wait-period to something high like 50 seconds, the OPA will not log anything related to shutting down, but will just stop running:

{"client_addr":"100.78.24.1:59184","level":"info","msg":"Received request.","req_id":36,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:09Z"}
{"client_addr":"100.78.24.1:59184","level":"info","msg":"Sent response.","req_id":36,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.389676,"resp_status":200,"time":"2024-06-04T12:28:09Z"}
{"level":"info","msg":"Status update sent successfully in response to bundle update.","plugin":"status","time":"2024-06-04T12:28:09Z"}
{"level":"info","msg":"Waiting 50s before initiating shutdown...","time":"2024-06-04T12:28:10Z"}
{"client_addr":"100.78.24.1:59188","level":"info","msg":"Received request.","req_id":37,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:11Z"}
{"client_addr":"100.78.24.1:59188","level":"info","msg":"Sent response.","req_id":37,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.787576,"resp_status":200,"time":"2024-06-04T12:28:11Z"}
{"level":"info","msg":"Status update sent successfully in response to decision log update.","plugin":"status","time":"2024-06-04T12:28:12Z"}
{"client_addr":"100.78.24.1:59196","level":"info","msg":"Received request.","req_id":38,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:13Z"}
....
{"client_addr":"100.78.24.1:57560","level":"info","msg":"Received request.","req_id":48,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:33Z"}
{"client_addr":"100.78.24.1:57560","level":"info","msg":"Sent response.","req_id":48,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.378612,"resp_status":200,"time":"2024-06-04T12:28:33Z"}
{"level":"info","msg":"Status update sent successfully in response to bundle update.","plugin":"status","time":"2024-06-04T12:28:34Z"}
{"client_addr":"100.78.24.1:39568","level":"info","msg":"Received request.","req_id":49,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:35Z"}
{"client_addr":"100.78.24.1:39568","level":"info","msg":"Sent response.","req_id":49,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.413382,"resp_status":200,"time":"2024-06-04T12:28:35Z"}
{"client_addr":"100.78.24.1:39570","level":"info","msg":"Received request.","req_id":50,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:37Z"}
{"client_addr":"100.78.24.1:39570","level":"info","msg":"Sent response.","req_id":50,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":1.012793,"resp_status":200,"time":"2024-06-04T12:28:37Z"}
{"client_addr":"100.78.24.1:39580","level":"info","msg":"Received request.","req_id":51,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:39Z"}
{"client_addr":"100.78.24.1:39580","level":"info","msg":"Sent response.","req_id":51,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.483066,"resp_status":200,"time":"2024-06-04T12:28:39Z"}
{"level":"info","msg":"Status update sent successfully in response to decision log update.","plugin":"status","time":"2024-06-04T12:28:39Z"}

@stale stale bot removed the inactive label Jun 4, 2024
@ashutosh-narkar
Copy link
Member

Do you set any value for --shutdown-grace-period?

@bdjgs
Copy link
Contributor

bdjgs commented Jun 6, 2024

No, I couldn't really see any difference when setting the flag. For example, if I set it to 50 and deleted the OPA pod, it would just close down as if the flag wasn't set.

@ashutosh-narkar
Copy link
Member

The plugin manager will create a new context with the timeout value set to the Graceful Shutdown Period (--shutdown-grace-period). A quick look at the code in the status plugin and it looks like it may be not be taking that timeout into account iiuc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants