New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Actor State: Ignore Components which hot reload the actor state #7441
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #7441 +/- ##
=========================================
Coverage ? 62.40%
=========================================
Files ? 244
Lines ? 22157
Branches ? 0
=========================================
Hits ? 13827
Misses ? 7192
Partials ? 1138 ☔ View full report in Codecov by Sentry. |
Can you explain what "exit with error" means? Panic? |
@ItalyPaleAle gracefully shutting down, logging an error, and exiting with a non-zero exit code. |
That would be a breaking change. Please, let's not force daprd to restart unexpectedly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't make daprd exit if hot reloading for that component type isn't supported. Just show a message
@ItalyPaleAle it is not a breaking change because HotReloading is an opt-in feature. |
It's a preview feature today, but it's just a preview flag so eventually will be removed :) |
We discussed this in the OSS endgame meeting- this is the safest thing to do as not doing anything will put daprd in both a corrupted state locally, as well as a replica set having inconsistent configuration when scaled up after a config change. Why would we not want to exit here?- actors will likely not work anymore anyway when the config changes. What would be your preferred behaviour? |
I remember discussing that the actor state store should not be hot-reloaded. I assumed that meant that we would continue running daprd without changing the component. So essentially changes to the actor state store component would be ignored. |
@JoshVanL Let me explain a bit more why crashing (or even just gracefully shutting down) is a problem. Components like actor state stores are generally applied to a large number of apps - basically every app that can be an actor host. Now, imagine that hot reloading is enabled (today that requires a feature flag, in the future it won't). You deploy a new Component in the cluster that updates the actor state store. All of a sudden, EVERY actor host in your cluster will shut down. There's no orchestration, so every replica will shut down at the same time. Even though K8s will restart the apps, this is almost certainly going to cause a downtime, as all replicas are restarted at the same time. If you do want to restart apps, it should be a staged rollout, but that's not possible unless you can interact with the K8s APIs. |
Updates the hot reloading reconciler so that Daprd will exit error when a actor state store enabled Component is hot reloaded. This is chosen because today, the actors subsystem is not written with any closing or dynamic support. Doing so will cause panics/corruption in its current state. Exiting error is the safest option as this ensures consistency across a replica set and ensures there is no surprise for the user that behaviour does not match given configuration. See also dapr#7433 Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>
8aaf919
to
40f564b
Compare
@ItalyPaleAle I understand the concern however I think the same argument can be made for most/all Kubernetes resources in that editing them can cause catastrophic results to uptime. For example, editing an ingress rule, network policy or secret can equally cause downtime for a service. While we should aim to reduce the number of footguns for our users that needlessly make running Dapr at scale less reliable, the user should ultimately be in charge or their own destiny and they are in charge of orchestrating a proper roll out of config change. I also think it is more damaging to allow Dapr to ignore some config change and not others as this results in the user not being able to trust whether a deployed config is actually being used or not by the software. Besides this, the statestore itself may or may not actually exist anymore meaning the actor subsystem will be in a corrupted or undefined state. Exiting error is always the safest procedure to perform in this case. |
Signed-off-by: Artur Souza <asouza.pro@gmail.com>
I agree with @ItalyPaleAle on this one while also agreeing with @JoshVanL that actors runtime will get corrupted in hot reloading and needs a fix. I think availability beats usability. I see 2 alternatives to proceed with this:
Both are temporary solutions until we do support hot reloading for actor runtime. |
Hot reloading is a preview feature. It's perfectly fine to emit a warning and document this as being unsupported for now. So I agree with option 1 |
I don't think adding a random delay is a good option. It could cause multiple apps to restart at the same time, and doesn't take into account that some apps may have a long(er) startup time. If the actor state store changes, we should allow the administrators to restart the apps, so they can choose the best strategy for their situation. In some cases, a rollout 1-by-1 may be best: this could be the case for example if the password is just being rotated and the old one is still working temporarily (and avoids downtimes). In others, where the old actor state store is currently not working, then restarting all apps at once may be best. Dapr doesn't know, and cannot know, what the best approach is. So I think we should err on the side of caution and simply not restart apps and let administrators do that. As for whether hot-reloading an actor state store will ever be possible, I have my doubts. Because we don't know if the state store is currently broken or we're just rotating a password, as mentioned above, we cannot do that without causing downtime. We would, at the very least, need to halt all running actors, which is going to cause a downtime (even if less than restarting apps). |
Can you clarity how you propose we proceed? I see that you mentioned to simply emit a warning and do nothing - wouldn't run with a corrupt actor runtime be more problematic. Maybe I misundertood your suggestion. |
My proposal is not hot-reload the actor state store. There's no way to do that without either a downtime (which may not be necessary) or without possibly having a corrupt state. Josh's assumption is that when the actor state store component changes, it's because the old one doesn't work. I don't think that covers all cases. For example, one could be changing the actor state store just to rotate credentials (so the old ones are still valid at least for now). As long as there are active actors, the actor state store can't be changed from underneath them as actors assume the state is consistent on the state store. So, to hot-reload the actor state store we would need to stop all active actors, and that causes a downtime. The safest option would be to not do any hot-reloading for the actor state store, and let administrators decide if, how, and when restart the apps. This way, administrators can restart apps with a progressive rollout in the case of the credentials having changed, or could restart all apps at once if the current state store is broken. |
I think it should be a long term goal for Dapr to enable hot reloading for all subsystems in a graceful manor. This property is how users expect resources to behave in Kubernetes. We should avoid the case where desired state is never reconciled with the current state with no programmatic resolution beyond a fire and pray
This is not the case- the issue is that a hot reloaded component can be reconciled for a number of reasons including the Component type being changed. We cannot ignore a single component being reconciled while others are, it might be the case that a component with a different type now occupies that name. The entire hot reloading reconciler would need to be halted. If we want to protect users from taking down an entire replica set at once we can add a delay with jitter when shutting down. |
I don't think this is a solution either.
If you do want to do this "right", perhaps we should integrate with K8s and tell K8s to perform a rollout? And then add a field to the component indicating if the rollout should be progressive or not. |
in order to unblock the release and prevent releasing in the current state where it is possible to corrupt actors, I suggest we merge this PR as is and come up with a plan to improve this behaviour in the next release? Hot reloading is a preview feature so we are OK to change behaviour in this area in a future release. |
Correct. That was my suggestion too (option 1) - to not hot reload a component that is an actor state store. |
Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>
Updates the hot reloading reconciler so that Daprd will log an error then ignore Components which are acting as the actor state store when they are hot reloaded.
See also #7433