Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow annotation to specify shutdown pause #4313

Closed
trondhindenes opened this issue Mar 1, 2022 · 19 comments · Fixed by #7268
Closed

allow annotation to specify shutdown pause #4313

trondhindenes opened this issue Mar 1, 2022 · 19 comments · Fixed by #7268

Comments

@trondhindenes
Copy link
Contributor

In what area(s)?

/area runtime

Describe the feature

We're struggling with making dapr lifecycle robust, especially during shutdown. We have some pods that need a few seconds (up to 30) to shutdown, and during those 30 seconds, they require the dapr sidecar to be available so that messages can be sent. However, since these pods are in the "draining" state, we don't want them to receive more traffic

I propose that a new annotation is added: dapr.io/shutdown-wait-secs. If specified, the sidecar is kept alive for the specified number of seconds, but only for outbound traffic (e.g. it will not pick up new messages etc).
This annotation could be used to setup a preStop hook on the injected container and do something like
command: ["/bin/sh", "-c", "/bin/dapr -c prepShutdown && sleep 30 && /bin/dapr -c doShutdown"]
The point is that there needs to be a way to instruct dapr to go into "passive" mode where it will still forward messages to the queue etc, but not receive anything inbound. Without this it's very difficult to build a robust messaging system using dapr.

@trondhindenes
Copy link
Contributor Author

I tried to use the dapr service instead of the sidecar address in the meantime, but at least the health endpoint (curl http://<app-id>-dapr:80/v1.0/healthz) just gives a http 400 back :-(

@dapr-bot
Copy link
Collaborator

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale Issues and PRs without response label Apr 30, 2022
@dapr-bot
Copy link
Collaborator

dapr-bot commented May 7, 2022

This issue has been automatically closed because it has not had activity in the last 67 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

@dapr-bot dapr-bot closed this as completed May 7, 2022
@yaron2 yaron2 reopened this May 25, 2022
@dapr-bot dapr-bot removed the stale Issues and PRs without response label May 25, 2022
@artursouza
Copy link
Member

Summarizing the approach:

Stops new incoming messages right away for PubSub, Binding and Service Invocation.
Continue allowing outgoing API calls during grace period for all APIs.
Allow this to be configurable via K8s annotation.

@artursouza artursouza added this to the v1.8 milestone Jun 6, 2022
@CrazyHZM
Copy link
Member

CrazyHZM commented Jun 9, 2022

@artursouza
PR #4624 has done something, we still need a k8s annotation.

@artursouza artursouza modified the milestones: v1.8, v1.9 Jun 22, 2022
@CrazyHZM
Copy link
Member

I found dapr.io/graceful-shutdown-seconds to meet this requirement, can you use dapr.io/graceful-shutdown-seconds to try to solve your problem? @trondhindenes

@artursouza artursouza modified the milestones: v1.9, v1.10 Sep 29, 2022
@yaron2
Copy link
Member

yaron2 commented Nov 14, 2022

cc @trondhindenes

@trondhindenes
Copy link
Contributor Author

sorry, I was completely out of the loop here. Yes I believe it does.
There is a situation which I believel we aren't fully covering:

Imagine a pubsub scenario where a messagehandler processes messages, and for each processed message, it needs to send a message to another queue (confirmation message). Each message might take between 10 seconds and 10 minutes to complete (this is real-life btw, we used dapr to respond to uploaded audio files that would vary a lot in file size and thus processing them also varied a lot).

In this case I would set dapr.io/graceful-shutdown-seconds to the "worst-case setting" of 10 minutes to keep dapr running while the messagehandler finishes its work and is able to send the confirmation message. This means (if I'm understanding things correctly) that dapr will keep the pod up for 10 whether it's needed or not.

This is why I believe it's crucial to have the ability for dapr to somehow ask the "main app" if it's safe to stop. Dapr should exit as soon as the app container has exited - it doesn't make sense to have dapr around after the app container has quit. (This is what I experimented with in DAPS (https://github.com/trondhindenes/daps) and it worked quite well. Would ofc be much better to have that functionality in dapr itself)

@yaron2
Copy link
Member

yaron2 commented Nov 18, 2022

@trondhindenes
Copy link
Contributor Author

ah, nice. I guess the combination of graceful-shutdown-seconds and health checks will solve this. I'll do some testing! Thanks!

@akhilac1
Copy link
Contributor

@trondhindenes - Please check the draft PR #5562. It is in draft so that I can test thoroughly

@yaron2 yaron2 modified the milestones: v1.10, v1.11 Feb 1, 2023
@artursouza
Copy link
Member

@akhilac1 Is this still required after #5562?

@artursouza artursouza removed this from the v1.11 milestone Mar 14, 2023
@trondhindenes
Copy link
Contributor Author

Hi, sorry for the late reply. I belive this issue is fixed now. I haven't verified it against the described use-case since I don't work on that project any more, but let's close this issue. I'll do some more testing when I have time, and if I discover anything iffy related to this I'll create a new issue (link to this one for history).

@trondhindenes
Copy link
Contributor Author

okay-to-close

@JoshVanL
Copy link
Contributor

JoshVanL commented Nov 21, 2023

I don't think graceful-shutdown-seconds covers the use case as described in the feature request as they are subtly but importantly different. Reoppening this issue as there are requests for it to be implemented.

I propose adding a block-shutdown-seconds option (default disabled) that will block the entire shutdown procedure until either 1. the app reports as unhealthy, or 2. the number of seconds has been reached on the block.

@JoshVanL JoshVanL reopened this Nov 21, 2023
@husseineAtBay
Copy link

husseineAtBay commented Dec 3, 2023

Hi @JoshVanL

Following the discussion in 7211, There is an interesting case for pubsub during shutdown which we found that Dapr doesn't permit the subscriber application to finish the last requests sent to the subscriber application, and the requests contexts are cancelled immediately, which the expectation was to enable the last requests to be processed and not to interrupt them (maybe worth defining a dedicated timeout mechanism for these last requests.
Will this be addressed in either graceful-shutdown-seconds or block-shutdown-seconds?

@JoshVanL
Copy link
Contributor

JoshVanL commented Dec 3, 2023

Hi @husseineAtBay, indeed the intention is to cover this case (with my proposal) through block-shutdown-seconds to allow the Dapr APIs to remain functional until the seconds limit has been reached or the app reports unhealthy.

@husseineAtBay
Copy link

Hi @JoshVanL

just to verify I was clear, we are talking about the requests sent from dapr side car to the subscriber app to invoke our application in pub/sub use case, you described the other direction (subscriber app calls to dapr side car during shutdown).

@JoshVanL
Copy link
Contributor

JoshVanL commented Dec 4, 2023

@husseineAtBay the other direction should also be covered, we make no distinction at this level- so long as your application is coded to continue to operate after a TERM signal has been received from Kubernetes up until it is ready to exit.

JoshVanL added a commit to JoshVanL/dapr-docs that referenced this issue Dec 4, 2023
Part of dapr/dapr#4313

Signed-off-by: joshvanl <me@joshvanl.dev>
JoshVanL added a commit to JoshVanL/dapr that referenced this issue Dec 4, 2023
Closes dapr#4313
Docs: dapr/docs#3893

PR adds the `--block-shutdown-seconds` CLI flag and corresponding
`dapr.io/block-shutdown-seconds` Kubernetes annotation which configures
Daprd to block the graceful shutdown procedure until _either_, the
block shutdown seconds has elapsed _or_ the application has become
unhealthy, as according to the normal app health status.

By default, this option is unset, and therefore there is no effect to
the current behaviour of graceful shutdown. When set, Daprd will
block the interrupt signal cascading into runtime until the above
requirements have been met.

The framework process `Cleanup` order has been reversed to mimic
`t.Cleanup` and allow the `logline` process to function correctly.

Signed-off-by: joshvanl <me@joshvanl.dev>
yaron2 pushed a commit that referenced this issue Dec 6, 2023
* Adds Daprd option `--block-shutdown-seconds`

Closes #4313
Docs: dapr/docs#3893

PR adds the `--block-shutdown-seconds` CLI flag and corresponding
`dapr.io/block-shutdown-seconds` Kubernetes annotation which configures
Daprd to block the graceful shutdown procedure until _either_, the
block shutdown seconds has elapsed _or_ the application has become
unhealthy, as according to the normal app health status.

By default, this option is unset, and therefore there is no effect to
the current behaviour of graceful shutdown. When set, Daprd will
block the interrupt signal cascading into runtime until the above
requirements have been met.

The framework process `Cleanup` order has been reversed to mimic
`t.Cleanup` and allow the `logline` process to function correctly.

Signed-off-by: joshvanl <me@joshvanl.dev>

* Revert if check on killing process exec proc cleanup

Signed-off-by: joshvanl <me@joshvanl.dev>

* Revert error ignore of processes already killed in unix

Signed-off-by: joshvanl <me@joshvanl.dev>

* Skip shutdown/graceful/block/healthy on windows.

* Skip shutdown/block/unhealthy test on windows.

* Linting

Signed-off-by: joshvanl <me@joshvanl.dev>

* Updates `dapr-block-shutdown-seconds` to `dapr-block-shutdown-duration`

Signed-off-by: joshvanl <me@joshvanl.dev>

---------

Signed-off-by: joshvanl <me@joshvanl.dev>
Co-authored-by: Loong Dai <long.dai@intel.com>
DeepanshuA pushed a commit to DeepanshuA/dapr that referenced this issue Dec 11, 2023
* Adds Daprd option `--block-shutdown-seconds`

Closes dapr#4313
Docs: dapr/docs#3893

PR adds the `--block-shutdown-seconds` CLI flag and corresponding
`dapr.io/block-shutdown-seconds` Kubernetes annotation which configures
Daprd to block the graceful shutdown procedure until _either_, the
block shutdown seconds has elapsed _or_ the application has become
unhealthy, as according to the normal app health status.

By default, this option is unset, and therefore there is no effect to
the current behaviour of graceful shutdown. When set, Daprd will
block the interrupt signal cascading into runtime until the above
requirements have been met.

The framework process `Cleanup` order has been reversed to mimic
`t.Cleanup` and allow the `logline` process to function correctly.

Signed-off-by: joshvanl <me@joshvanl.dev>

* Revert if check on killing process exec proc cleanup

Signed-off-by: joshvanl <me@joshvanl.dev>

* Revert error ignore of processes already killed in unix

Signed-off-by: joshvanl <me@joshvanl.dev>

* Skip shutdown/graceful/block/healthy on windows.

* Skip shutdown/block/unhealthy test on windows.

* Linting

Signed-off-by: joshvanl <me@joshvanl.dev>

* Updates `dapr-block-shutdown-seconds` to `dapr-block-shutdown-duration`

Signed-off-by: joshvanl <me@joshvanl.dev>

---------

Signed-off-by: joshvanl <me@joshvanl.dev>
Co-authored-by: Loong Dai <long.dai@intel.com>
JoshVanL added a commit to JoshVanL/dapr that referenced this issue Dec 14, 2023
* Adds Daprd option `--block-shutdown-seconds`

Closes dapr#4313
Docs: dapr/docs#3893

PR adds the `--block-shutdown-seconds` CLI flag and corresponding
`dapr.io/block-shutdown-seconds` Kubernetes annotation which configures
Daprd to block the graceful shutdown procedure until _either_, the
block shutdown seconds has elapsed _or_ the application has become
unhealthy, as according to the normal app health status.

By default, this option is unset, and therefore there is no effect to
the current behaviour of graceful shutdown. When set, Daprd will
block the interrupt signal cascading into runtime until the above
requirements have been met.

The framework process `Cleanup` order has been reversed to mimic
`t.Cleanup` and allow the `logline` process to function correctly.

Signed-off-by: joshvanl <me@joshvanl.dev>

* Revert if check on killing process exec proc cleanup

Signed-off-by: joshvanl <me@joshvanl.dev>

* Revert error ignore of processes already killed in unix

Signed-off-by: joshvanl <me@joshvanl.dev>

* Skip shutdown/graceful/block/healthy on windows.

* Skip shutdown/block/unhealthy test on windows.

* Linting

Signed-off-by: joshvanl <me@joshvanl.dev>

* Updates `dapr-block-shutdown-seconds` to `dapr-block-shutdown-duration`

Signed-off-by: joshvanl <me@joshvanl.dev>

---------

Signed-off-by: joshvanl <me@joshvanl.dev>
Co-authored-by: Loong Dai <long.dai@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants