Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getReleaseState may sometimes cause an unwanted rollback #227

Open
kovayur opened this issue Jul 31, 2023 · 0 comments
Open

getReleaseState may sometimes cause an unwanted rollback #227

kovayur opened this issue Jul 31, 2023 · 0 comments

Comments

@kovayur
Copy link

kovayur commented Jul 31, 2023

Problem

If I have a CR in a chart and remove its definition from the cluster, it may result in a broken operator state:

  1. If I have a single revision, the operator constantly prints the error:
rollback failed: release: not found: original upgrade error: unable to build kubernetes objects from current release manifest: [resource mapping not found for name: "stackrox-central" namespace: "" from "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "stackrox-scanner" namespace: "" from "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1" ensure CRDs are installed first]
  1. If I have more than 1 revision and the previous revision does NOT contain a CR, I get the endless reconcile loop as described in Failed upgrade may lead to an endless loop of rollbacks #224
  2. If I have more than 1 revision and the previous revision also contains a CR, then the rollback fails and the release gets stuck in pending-rollback state. The change Allow marking releases stuck in a pending state as failed #116 recovers the release.

We've discovered this issue with clusters that have been upgraded to 1.25 and have had the PSPs removed. However, this applies to any CRD.

Root cause

getReleaseState calls actionClient.Upgradewith the DryRun flag. This function tries to infer whether release was changed in storage based on return value of Upgrade.Run. From the comment it seems to me that it is expected that the returned release should not be nil with DryRun but apparently that is not the case (at least with Helm v3.12.1):

// As of Helm 2.13, if Upgrade returns a non-nil release, that
// means the release was also recorded in the release store.
// Therefore, we should perform the rollback when we have a non-nil
// release. Any rollback error here would be unexpected, so always
// log both the update and rollback errors.

Thus, when the dry-run upgrade fails, action client performs a non-dry-run rollback to the previous revision.

From the Helm upgrade source code:
https://github.com/helm/helm/blob/main/pkg/action/upgrade.go#L293-L298

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant