Actor State: Ignore Components which hot reload the actor state #7441

JoshVanL · 2024-01-24T11:46:31Z

Updates the hot reloading reconciler so that Daprd will log an error then ignore Components which are acting as the actor state store when they are hot reloaded.

Codecov Report

Attention: 42 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (master@2c09401). Click here to learn what that means.

Files	Patch %	Lines
pkg/runtime/processor/state/state.go	41.46%	19 Missing and 5 partials ⚠️
pkg/runtime/hotreload/reconciler/component.go	0.00%	17 Missing ⚠️
pkg/runtime/hotreload/reconciler/reconciler.go	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master    #7441   +/-   ##
=========================================
  Coverage          ?   62.40%           
=========================================
  Files             ?      244           
  Lines             ?    22157           
  Branches          ?        0           
=========================================
  Hits              ?    13827           
  Misses            ?     7192           
  Partials          ?     1138

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ItalyPaleAle · 2024-01-24T16:20:33Z

Can you explain what "exit with error" means? Panic?

JoshVanL · 2024-01-24T16:37:53Z

Can you explain what "exit with error" means? Panic?

@ItalyPaleAle gracefully shutting down, logging an error, and exiting with a non-zero exit code.

ItalyPaleAle · 2024-01-24T16:39:33Z

Can you explain what "exit with error" means? Panic?

@ItalyPaleAle gracefully shutting down, logging an error, and exiting with a non-zero exit code.

That would be a breaking change. Please, let's not force daprd to restart unexpectedly.

ItalyPaleAle

Please don't make daprd exit if hot reloading for that component type isn't supported. Just show a message

JoshVanL · 2024-01-24T16:40:43Z

@ItalyPaleAle it is not a breaking change because HotReloading is an opt-in feature.

ItalyPaleAle · 2024-01-24T16:42:19Z

@ItalyPaleAle it is not a breaking change because HotReloading is an opt-in feature.

It's a preview feature today, but it's just a preview flag so eventually will be removed :)

JoshVanL · 2024-01-24T16:44:52Z

We discussed this in the OSS endgame meeting- this is the safest thing to do as not doing anything will put daprd in both a corrupted state locally, as well as a replica set having inconsistent configuration when scaled up after a config change.

Why would we not want to exit here?- actors will likely not work anymore anyway when the config changes. What would be your preferred behaviour?

ItalyPaleAle · 2024-01-24T17:11:10Z

I remember discussing that the actor state store should not be hot-reloaded.

I assumed that meant that we would continue running daprd without changing the component. So essentially changes to the actor state store component would be ignored.

ItalyPaleAle · 2024-01-25T04:35:13Z

@JoshVanL Let me explain a bit more why crashing (or even just gracefully shutting down) is a problem.

Components like actor state stores are generally applied to a large number of apps - basically every app that can be an actor host.

Now, imagine that hot reloading is enabled (today that requires a feature flag, in the future it won't). You deploy a new Component in the cluster that updates the actor state store.

All of a sudden, EVERY actor host in your cluster will shut down. There's no orchestration, so every replica will shut down at the same time. Even though K8s will restart the apps, this is almost certainly going to cause a downtime, as all replicas are restarted at the same time.

If you do want to restart apps, it should be a staged rollout, but that's not possible unless you can interact with the K8s APIs.

Updates the hot reloading reconciler so that Daprd will exit error when a actor state store enabled Component is hot reloaded. This is chosen because today, the actors subsystem is not written with any closing or dynamic support. Doing so will cause panics/corruption in its current state. Exiting error is the safest option as this ensures consistency across a replica set and ensures there is no surprise for the user that behaviour does not match given configuration. See also dapr#7433 Signed-off-by: joshvanl <me@joshvanl.dev>

Signed-off-by: joshvanl <me@joshvanl.dev>

JoshVanL · 2024-01-25T21:03:23Z

@ItalyPaleAle I understand the concern however I think the same argument can be made for most/all Kubernetes resources in that editing them can cause catastrophic results to uptime. For example, editing an ingress rule, network policy or secret can equally cause downtime for a service. While we should aim to reduce the number of footguns for our users that needlessly make running Dapr at scale less reliable, the user should ultimately be in charge or their own destiny and they are in charge of orchestrating a proper roll out of config change. I also think it is more damaging to allow Dapr to ignore some config change and not others as this results in the user not being able to trust whether a deployed config is actually being used or not by the software.

Besides this, the statestore itself may or may not actually exist anymore meaning the actor subsystem will be in a corrupted or undefined state. Exiting error is always the safest procedure to perform in this case.

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

artursouza · 2024-01-26T18:06:26Z

I agree with @ItalyPaleAle on this one while also agreeing with @JoshVanL that actors runtime will get corrupted in hot reloading and needs a fix. I think availability beats usability. I see 2 alternatives to proceed with this:

Detect that the component being reloaded is an actor state store and NOT hot reload it - emit a warning and continue with the old config.
OR, add an arbitrary delay between 0s and 1min (with a warning) and shutdown the sidecar. This way, not all sidecars will be restarted at the same time.

Both are temporary solutions until we do support hot reloading for actor runtime.

yaron2 · 2024-01-26T18:22:43Z

I agree with @ItalyPaleAle on this one while also agreeing with @JoshVanL that actors runtime will get corrupted in hot reloading and needs a fix. I think availability beats usability. I see 2 alternatives to proceed with this:

Detect that the component being reloaded is an actor state store and NOT hot reload it - emit a warning and continue with the old config.

OR, add an arbitrary delay between 0s and 1min (with a warning) and shutdown the sidecar. This way, not all sidecars will be restarted at the same time.

Both are temporary solutions until we do support hot reloading for actor runtime.

Hot reloading is a preview feature. It's perfectly fine to emit a warning and document this as being unsupported for now. So I agree with option 1

ItalyPaleAle · 2024-01-26T18:22:46Z

I don't think adding a random delay is a good option. It could cause multiple apps to restart at the same time, and doesn't take into account that some apps may have a long(er) startup time.

If the actor state store changes, we should allow the administrators to restart the apps, so they can choose the best strategy for their situation. In some cases, a rollout 1-by-1 may be best: this could be the case for example if the password is just being rotated and the old one is still working temporarily (and avoids downtimes). In others, where the old actor state store is currently not working, then restarting all apps at once may be best.

Dapr doesn't know, and cannot know, what the best approach is. So I think we should err on the side of caution and simply not restart apps and let administrators do that.

As for whether hot-reloading an actor state store will ever be possible, I have my doubts. Because we don't know if the state store is currently broken or we're just rotating a password, as mentioned above, we cannot do that without causing downtime. We would, at the very least, need to halt all running actors, which is going to cause a downtime (even if less than restarting apps).

artursouza · 2024-01-26T18:29:38Z

I don't think adding a random delay is a good option. It could cause multiple apps to restart at the same time, and doesn't take into account that some apps may have a long(er) startup time.

If the actor state store changes, we should allow the administrators to restart the apps, so they can choose the best strategy for their situation. In some cases, a rollout 1-by-1 may be best: this could be the case for example if the password is just being rotated and the old one is still working temporarily (and avoids downtimes). In others, where the old actor state store is currently not working, then restarting all apps at once may be best.

Dapr doesn't know, and cannot know, what the best approach is. So I think we should err on the side of caution and simply not restart apps and let administrators do that.

As for whether hot-reloading an actor state store will ever be possible, I have my doubts. Because we don't know if the state store is currently broken or we're just rotating a password, as mentioned above, we cannot do that without causing downtime. We would, at the very least, need to halt all running actors, which is going to cause a downtime (even if less than restarting apps).

Can you clarity how you propose we proceed? I see that you mentioned to simply emit a warning and do nothing - wouldn't run with a corrupt actor runtime be more problematic. Maybe I misundertood your suggestion.

ItalyPaleAle · 2024-01-26T18:34:29Z

My proposal is not hot-reload the actor state store. There's no way to do that without either a downtime (which may not be necessary) or without possibly having a corrupt state.

Josh's assumption is that when the actor state store component changes, it's because the old one doesn't work. I don't think that covers all cases. For example, one could be changing the actor state store just to rotate credentials (so the old ones are still valid at least for now).

As long as there are active actors, the actor state store can't be changed from underneath them as actors assume the state is consistent on the state store. So, to hot-reload the actor state store we would need to stop all active actors, and that causes a downtime.

The safest option would be to not do any hot-reloading for the actor state store, and let administrators decide if, how, and when restart the apps. This way, administrators can restart apps with a progressive rollout in the case of the credentials having changed, or could restart all apps at once if the current state store is broken.

JoshVanL · 2024-01-26T19:17:16Z

My proposal is not hot-reload the actor state store. There's no way to do that without either a downtime (which may not be necessary) or without possibly having a corrupt state.

I think it should be a long term goal for Dapr to enable hot reloading for all subsystems in a graceful manor. This property is how users expect resources to behave in Kubernetes. We should avoid the case where desired state is never reconciled with the current state with no programmatic resolution beyond a fire and pray kubectl rollout after a time.Sleep. Killing pods is also expensive.

Josh's assumption is that when the actor state store component changes, it's because the old one doesn't work. I don't think that covers all cases. For example, one could be changing the actor state store just to rotate credentials (so the old ones are still valid at least for now).

This is not the case- the issue is that a hot reloaded component can be reconciled for a number of reasons including the Component type being changed. We cannot ignore a single component being reconciled while others are, it might be the case that a component with a different type now occupies that name. The entire hot reloading reconciler would need to be halted.

If we want to protect users from taking down an entire replica set at once we can add a delay with jitter when shutting down.

ItalyPaleAle · 2024-01-26T19:22:41Z

If we want to protect users from taking down an entire replica set at once we can add a delay with jitter when shutting down.

I don't think this is a solution either.

If the existing component is "broken" and an immediate rollout is needed, you are prolonging a downtime.
Apps could take a long time to start up, sometimes multiple seconds. Kubernetes can know that due to readiness probes, but we can't, so we could still cause longer than needed downtimes if a progressive rollout was more appropriate.

If you do want to do this "right", perhaps we should integrate with K8s and tell K8s to perform a rollout? And then add a field to the component indicating if the rollout should be progressive or not.

JoshVanL · 2024-01-29T12:33:25Z

in order to unblock the release and prevent releasing in the current state where it is possible to corrupt actors, I suggest we merge this PR as is and come up with a plan to improve this behaviour in the next release? Hot reloading is a preview feature so we are OK to change behaviour in this area in a future release.

artursouza · 2024-01-29T15:45:54Z

My proposal is not hot-reload the actor state store. There's no way to do that without either a downtime (which may not be necessary) or without possibly having a corrupt state.

Josh's assumption is that when the actor state store component changes, it's because the old one doesn't work. I don't think that covers all cases. For example, one could be changing the actor state store just to rotate credentials (so the old ones are still valid at least for now).

As long as there are active actors, the actor state store can't be changed from underneath them as actors assume the state is consistent on the state store. So, to hot-reload the actor state store we would need to stop all active actors, and that causes a downtime.

The safest option would be to not do any hot-reloading for the actor state store, and let administrators decide if, how, and when restart the apps. This way, administrators can restart apps with a progressive rollout in the case of the credentials having changed, or could restart all apps at once if the current state store is broken.

Correct. That was my suggestion too (option 1) - to not hot reload a component that is an actor state store.

Signed-off-by: joshvanl <me@joshvanl.dev>

JoshVanL requested review from a team as code owners January 24, 2024 11:46

ItalyPaleAle requested changes Jan 24, 2024

View reviewed changes

JoshVanL added 2 commits January 25, 2024 20:36

Remove expected workflow.dapr component from actorastate tests

40f564b

Signed-off-by: joshvanl <me@joshvanl.dev>

JoshVanL force-pushed the hotreloading-actorstate branch from 8aaf919 to 40f564b Compare January 25, 2024 20:53

Merge branch 'master' into hotreloading-actorstate

86eaeed

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

JoshVanL mentioned this pull request Jan 26, 2024

v1.13 Endgame #7410

Open

JoshVanL added 2 commits January 29, 2024 17:29

Increase assert Eventually timeout

68d7151

Signed-off-by: joshvanl <me@joshvanl.dev>

Only log error and don't exit error when reocnciling actor state store

be33a93

Signed-off-by: joshvanl <me@joshvanl.dev>

JoshVanL changed the title ~~Exit error if actor state store hot reloaded~~ Actor State: Ignore Components which hot reload the actor state Jan 29, 2024

Merge branch 'master' into hotreloading-actorstate

0af64ea

artursouza approved these changes Jan 29, 2024

View reviewed changes

artursouza requested a review from ItalyPaleAle January 29, 2024 23:44

ItalyPaleAle approved these changes Jan 30, 2024

View reviewed changes

yaron2 merged commit 4d44561 into dapr:master Jan 30, 2024
21 of 22 checks passed

JoshVanL added this to the v1.13 milestone Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor State: Ignore Components which hot reload the actor state #7441

Actor State: Ignore Components which hot reload the actor state #7441

JoshVanL commented Jan 24, 2024 •

edited

codecov bot commented Jan 24, 2024 •

edited

ItalyPaleAle commented Jan 24, 2024

JoshVanL commented Jan 24, 2024

ItalyPaleAle commented Jan 24, 2024

ItalyPaleAle left a comment

JoshVanL commented Jan 24, 2024

ItalyPaleAle commented Jan 24, 2024

JoshVanL commented Jan 24, 2024

ItalyPaleAle commented Jan 24, 2024

ItalyPaleAle commented Jan 25, 2024

JoshVanL commented Jan 25, 2024 •

edited

artursouza commented Jan 26, 2024

yaron2 commented Jan 26, 2024

ItalyPaleAle commented Jan 26, 2024

artursouza commented Jan 26, 2024

ItalyPaleAle commented Jan 26, 2024

JoshVanL commented Jan 26, 2024

ItalyPaleAle commented Jan 26, 2024

JoshVanL commented Jan 29, 2024

artursouza commented Jan 29, 2024

Actor State: Ignore Components which hot reload the actor state #7441

Actor State: Ignore Components which hot reload the actor state #7441

Conversation

JoshVanL commented Jan 24, 2024 • edited

codecov bot commented Jan 24, 2024 • edited

Codecov Report

ItalyPaleAle commented Jan 24, 2024

JoshVanL commented Jan 24, 2024

ItalyPaleAle commented Jan 24, 2024

ItalyPaleAle left a comment

Choose a reason for hiding this comment

JoshVanL commented Jan 24, 2024

ItalyPaleAle commented Jan 24, 2024

JoshVanL commented Jan 24, 2024

ItalyPaleAle commented Jan 24, 2024

ItalyPaleAle commented Jan 25, 2024

JoshVanL commented Jan 25, 2024 • edited

artursouza commented Jan 26, 2024

yaron2 commented Jan 26, 2024

ItalyPaleAle commented Jan 26, 2024

artursouza commented Jan 26, 2024

ItalyPaleAle commented Jan 26, 2024

JoshVanL commented Jan 26, 2024

ItalyPaleAle commented Jan 26, 2024

JoshVanL commented Jan 29, 2024

artursouza commented Jan 29, 2024

JoshVanL commented Jan 24, 2024 •

edited

codecov bot commented Jan 24, 2024 •

edited

JoshVanL commented Jan 25, 2024 •

edited