Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actor State: Ignore Components which hot reload the actor state #7441

Merged
merged 6 commits into from Jan 30, 2024

Conversation

JoshVanL
Copy link
Contributor

@JoshVanL JoshVanL commented Jan 24, 2024

Updates the hot reloading reconciler so that Daprd will log an error then ignore Components which are acting as the actor state store when they are hot reloaded.

See also #7433

@JoshVanL JoshVanL requested review from a team as code owners January 24, 2024 11:46
Copy link

codecov bot commented Jan 24, 2024

Codecov Report

Attention: 42 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (master@2c09401). Click here to learn what that means.

Files Patch % Lines
pkg/runtime/processor/state/state.go 41.46% 19 Missing and 5 partials ⚠️
pkg/runtime/hotreload/reconciler/component.go 0.00% 17 Missing ⚠️
pkg/runtime/hotreload/reconciler/reconciler.go 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master    #7441   +/-   ##
=========================================
  Coverage          ?   62.40%           
=========================================
  Files             ?      244           
  Lines             ?    22157           
  Branches          ?        0           
=========================================
  Hits              ?    13827           
  Misses            ?     7192           
  Partials          ?     1138           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ItalyPaleAle
Copy link
Contributor

Can you explain what "exit with error" means? Panic?

@JoshVanL
Copy link
Contributor Author

Can you explain what "exit with error" means? Panic?

@ItalyPaleAle gracefully shutting down, logging an error, and exiting with a non-zero exit code.

@ItalyPaleAle
Copy link
Contributor

Can you explain what "exit with error" means? Panic?

@ItalyPaleAle gracefully shutting down, logging an error, and exiting with a non-zero exit code.

That would be a breaking change. Please, let's not force daprd to restart unexpectedly.

Copy link
Contributor

@ItalyPaleAle ItalyPaleAle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't make daprd exit if hot reloading for that component type isn't supported. Just show a message

@JoshVanL
Copy link
Contributor Author

@ItalyPaleAle it is not a breaking change because HotReloading is an opt-in feature.

@ItalyPaleAle
Copy link
Contributor

@ItalyPaleAle it is not a breaking change because HotReloading is an opt-in feature.

It's a preview feature today, but it's just a preview flag so eventually will be removed :)

@JoshVanL
Copy link
Contributor Author

We discussed this in the OSS endgame meeting- this is the safest thing to do as not doing anything will put daprd in both a corrupted state locally, as well as a replica set having inconsistent configuration when scaled up after a config change.

Why would we not want to exit here?- actors will likely not work anymore anyway when the config changes. What would be your preferred behaviour?

@ItalyPaleAle
Copy link
Contributor

I remember discussing that the actor state store should not be hot-reloaded.

I assumed that meant that we would continue running daprd without changing the component. So essentially changes to the actor state store component would be ignored.

@ItalyPaleAle
Copy link
Contributor

@JoshVanL Let me explain a bit more why crashing (or even just gracefully shutting down) is a problem.

Components like actor state stores are generally applied to a large number of apps - basically every app that can be an actor host.

Now, imagine that hot reloading is enabled (today that requires a feature flag, in the future it won't). You deploy a new Component in the cluster that updates the actor state store.

All of a sudden, EVERY actor host in your cluster will shut down. There's no orchestration, so every replica will shut down at the same time. Even though K8s will restart the apps, this is almost certainly going to cause a downtime, as all replicas are restarted at the same time.

If you do want to restart apps, it should be a staged rollout, but that's not possible unless you can interact with the K8s APIs.

Updates the hot reloading reconciler so that Daprd will exit error when
a actor state store enabled Component is hot reloaded. This is chosen
because today, the actors subsystem is not written with any closing or
dynamic support. Doing so will cause panics/corruption in its current
state.

Exiting error is the safest option as this ensures consistency across a
replica set and ensures there is no surprise for the user that behaviour
does not match given configuration.

See also dapr#7433

Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>
@JoshVanL
Copy link
Contributor Author

JoshVanL commented Jan 25, 2024

@ItalyPaleAle I understand the concern however I think the same argument can be made for most/all Kubernetes resources in that editing them can cause catastrophic results to uptime. For example, editing an ingress rule, network policy or secret can equally cause downtime for a service. While we should aim to reduce the number of footguns for our users that needlessly make running Dapr at scale less reliable, the user should ultimately be in charge or their own destiny and they are in charge of orchestrating a proper roll out of config change. I also think it is more damaging to allow Dapr to ignore some config change and not others as this results in the user not being able to trust whether a deployed config is actually being used or not by the software.

Besides this, the statestore itself may or may not actually exist anymore meaning the actor subsystem will be in a corrupted or undefined state. Exiting error is always the safest procedure to perform in this case.

Signed-off-by: Artur Souza <asouza.pro@gmail.com>
@artursouza
Copy link
Member

I agree with @ItalyPaleAle on this one while also agreeing with @JoshVanL that actors runtime will get corrupted in hot reloading and needs a fix. I think availability beats usability. I see 2 alternatives to proceed with this:

  1. Detect that the component being reloaded is an actor state store and NOT hot reload it - emit a warning and continue with the old config.
  2. OR, add an arbitrary delay between 0s and 1min (with a warning) and shutdown the sidecar. This way, not all sidecars will be restarted at the same time.

Both are temporary solutions until we do support hot reloading for actor runtime.

@yaron2
Copy link
Member

yaron2 commented Jan 26, 2024

I agree with @ItalyPaleAle on this one while also agreeing with @JoshVanL that actors runtime will get corrupted in hot reloading and needs a fix. I think availability beats usability. I see 2 alternatives to proceed with this:

  1. Detect that the component being reloaded is an actor state store and NOT hot reload it - emit a warning and continue with the old config.
  2. OR, add an arbitrary delay between 0s and 1min (with a warning) and shutdown the sidecar. This way, not all sidecars will be restarted at the same time.

Both are temporary solutions until we do support hot reloading for actor runtime.

Hot reloading is a preview feature. It's perfectly fine to emit a warning and document this as being unsupported for now. So I agree with option 1

@ItalyPaleAle
Copy link
Contributor

I don't think adding a random delay is a good option. It could cause multiple apps to restart at the same time, and doesn't take into account that some apps may have a long(er) startup time.

If the actor state store changes, we should allow the administrators to restart the apps, so they can choose the best strategy for their situation. In some cases, a rollout 1-by-1 may be best: this could be the case for example if the password is just being rotated and the old one is still working temporarily (and avoids downtimes). In others, where the old actor state store is currently not working, then restarting all apps at once may be best.

Dapr doesn't know, and cannot know, what the best approach is. So I think we should err on the side of caution and simply not restart apps and let administrators do that.

As for whether hot-reloading an actor state store will ever be possible, I have my doubts. Because we don't know if the state store is currently broken or we're just rotating a password, as mentioned above, we cannot do that without causing downtime. We would, at the very least, need to halt all running actors, which is going to cause a downtime (even if less than restarting apps).

@artursouza
Copy link
Member

I don't think adding a random delay is a good option. It could cause multiple apps to restart at the same time, and doesn't take into account that some apps may have a long(er) startup time.

If the actor state store changes, we should allow the administrators to restart the apps, so they can choose the best strategy for their situation. In some cases, a rollout 1-by-1 may be best: this could be the case for example if the password is just being rotated and the old one is still working temporarily (and avoids downtimes). In others, where the old actor state store is currently not working, then restarting all apps at once may be best.

Dapr doesn't know, and cannot know, what the best approach is. So I think we should err on the side of caution and simply not restart apps and let administrators do that.

As for whether hot-reloading an actor state store will ever be possible, I have my doubts. Because we don't know if the state store is currently broken or we're just rotating a password, as mentioned above, we cannot do that without causing downtime. We would, at the very least, need to halt all running actors, which is going to cause a downtime (even if less than restarting apps).

Can you clarity how you propose we proceed? I see that you mentioned to simply emit a warning and do nothing - wouldn't run with a corrupt actor runtime be more problematic. Maybe I misundertood your suggestion.

@JoshVanL JoshVanL mentioned this pull request Jan 26, 2024
@ItalyPaleAle
Copy link
Contributor

My proposal is not hot-reload the actor state store. There's no way to do that without either a downtime (which may not be necessary) or without possibly having a corrupt state.

Josh's assumption is that when the actor state store component changes, it's because the old one doesn't work. I don't think that covers all cases. For example, one could be changing the actor state store just to rotate credentials (so the old ones are still valid at least for now).

As long as there are active actors, the actor state store can't be changed from underneath them as actors assume the state is consistent on the state store. So, to hot-reload the actor state store we would need to stop all active actors, and that causes a downtime.

The safest option would be to not do any hot-reloading for the actor state store, and let administrators decide if, how, and when restart the apps. This way, administrators can restart apps with a progressive rollout in the case of the credentials having changed, or could restart all apps at once if the current state store is broken.

@JoshVanL
Copy link
Contributor Author

My proposal is not hot-reload the actor state store. There's no way to do that without either a downtime (which may not be necessary) or without possibly having a corrupt state.

I think it should be a long term goal for Dapr to enable hot reloading for all subsystems in a graceful manor. This property is how users expect resources to behave in Kubernetes. We should avoid the case where desired state is never reconciled with the current state with no programmatic resolution beyond a fire and pray kubectl rollout after a time.Sleep. Killing pods is also expensive.

Josh's assumption is that when the actor state store component changes, it's because the old one doesn't work. I don't think that covers all cases. For example, one could be changing the actor state store just to rotate credentials (so the old ones are still valid at least for now).

This is not the case- the issue is that a hot reloaded component can be reconciled for a number of reasons including the Component type being changed. We cannot ignore a single component being reconciled while others are, it might be the case that a component with a different type now occupies that name. The entire hot reloading reconciler would need to be halted.

If we want to protect users from taking down an entire replica set at once we can add a delay with jitter when shutting down.

@ItalyPaleAle
Copy link
Contributor

If we want to protect users from taking down an entire replica set at once we can add a delay with jitter when shutting down.

I don't think this is a solution either.

  1. If the existing component is "broken" and an immediate rollout is needed, you are prolonging a downtime.
  2. Apps could take a long time to start up, sometimes multiple seconds. Kubernetes can know that due to readiness probes, but we can't, so we could still cause longer than needed downtimes if a progressive rollout was more appropriate.

If you do want to do this "right", perhaps we should integrate with K8s and tell K8s to perform a rollout? And then add a field to the component indicating if the rollout should be progressive or not.

@JoshVanL
Copy link
Contributor Author

in order to unblock the release and prevent releasing in the current state where it is possible to corrupt actors, I suggest we merge this PR as is and come up with a plan to improve this behaviour in the next release? Hot reloading is a preview feature so we are OK to change behaviour in this area in a future release.

@artursouza
Copy link
Member

My proposal is not hot-reload the actor state store. There's no way to do that without either a downtime (which may not be necessary) or without possibly having a corrupt state.

Josh's assumption is that when the actor state store component changes, it's because the old one doesn't work. I don't think that covers all cases. For example, one could be changing the actor state store just to rotate credentials (so the old ones are still valid at least for now).

As long as there are active actors, the actor state store can't be changed from underneath them as actors assume the state is consistent on the state store. So, to hot-reload the actor state store we would need to stop all active actors, and that causes a downtime.

The safest option would be to not do any hot-reloading for the actor state store, and let administrators decide if, how, and when restart the apps. This way, administrators can restart apps with a progressive rollout in the case of the credentials having changed, or could restart all apps at once if the current state store is broken.

Correct. That was my suggestion too (option 1) - to not hot reload a component that is an actor state store.

Signed-off-by: joshvanl <me@joshvanl.dev>
@JoshVanL JoshVanL changed the title Exit error if actor state store hot reloaded Actor State: Ignore Components which hot reload the actor state Jan 29, 2024
@yaron2 yaron2 merged commit 4d44561 into dapr:master Jan 30, 2024
21 of 22 checks passed
@JoshVanL JoshVanL added this to the v1.13 milestone Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants