Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from Flux 2.1.x to 2.2.2 leaves most HelmReleases in a broken state #4524

Closed
1 task done
wilmardo opened this issue Jan 3, 2024 · 12 comments
Closed
1 task done

Comments

@wilmardo
Copy link

wilmardo commented Jan 3, 2024

Describe the bug

It seems that after the upgrade some HelmReleases are migrated to the newer API object but some aren't. This won't go away unless the HelmReleases is removed or reconcile --force is used.
Most obvious thing is that the message in flux get hr doesn't show the new information. The more breaking thing is that dependencies aren't considered ready even when dependency is Ready and the message shows Helm upgrade succeeded (see ingress-nginx and cert-manager in the output below for example).

Seems pretty similiar to what this PR is trying to solve:
fluxcd/helm-controller#850

Which should be in v2.2.2 where I am still having this issue.

flux get hr output right after the upgrade:

# flux get hr
NAME                    REVISION                SUSPENDED       READY   MESSAGE
azure-workload-identity 1.1.0                   False           True    Helm upgrade succeeded
cert-exporter           3.4.1                   False           True    Helm upgrade succeeded for release guida-system/cert-exporter.v4 with chart cert-exporter@3.4.1
cert-manager            v1.13.1                 False           True    Helm upgrade succeeded
flux                    2.12.2                  False           True    Helm upgrade succeeded for release flux-system/flux.v5 with chart flux2@2.12.2
helm-exporter           1.2.11+7a3ebb3          False           True    Helm upgrade succeeded
ingress-nginx           4.8.3                   False           False   dependency 'flux-system/cert-manager' is not ready
kyverno                 3.1.1                   False           True    Helm upgrade succeeded
prometheus-operator     51.2.0                  False           True    Helm upgrade succeeded for release guida-system/prometheus-operator.v3 with chart kube-prometheus-stack@51.2.0
rbac-manager            1.17.6                  False           True    Helm upgrade succeeded
sealed-secrets          2.13.0                  False           True    Helm upgrade succeeded
velero                  5.0.2                   False           True    Helm upgrade succeeded

All the releases showing Helm upgrade succeeded or dependency 'flux-system/xxx' is not ready won't go to the new message without a --force or deletion.

I tried:

apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  annotations:
    fluxcd.io/upgradeTo: v2beta2
  driftDetection:
    ignore:
    - paths:
      - /spec/replicas
      target:
        kind: Deployment
    mode: enabled

It would be extremely nice if the upgrade could be autonomous and does not require human intervention to run reconcile --force of all HelmReleases. The --force will break in some occasions as well (AWS with an NLB on a Service for example).

Steps to reproduce

  1. Have Flux 2.1.1 running on the cluster with several HelmReleases
  2. Upgrade Flux to 2.2.2
  3. See the output of flux get hr with different messages and stuck dependencies

Expected behavior

All the HelmReleases to show the new message and being accepted as ready

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v2.2.2

Flux check

► checking prerequisites
✔ Kubernetes 1.27.5+k3s1 >=1.26.0-0
► checking version in cluster
✔ distribution: flux-2.2.2
✔ bootstrapped: false
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.37.2
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.2.1
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.2.3
► checking crds
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta2
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

Reconcile log of a 'stuck' HelmRelease:

{"level":"info","ts":"2024-01-03T15:34:41.074Z","msg":"HelmChart/flux-system/flux-system-azure-workload-identity with SourceRef 'HelmRepository/flux-system/guida-mirror' is in-sync","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"azure-workload-identity","namespace":"flux-system"},"namespace":"flux-system","name":"azure-workload-identity","reconcileID":"fc1561fa-9ef1-42b9-8578-21e9585b9ff4"}
{"level":"info","ts":"2024-01-03T15:34:41.299Z","msg":"release in-sync with desired state","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"azure-workload-identity","namespace":"flux-system"},"namespace":"flux-system","name":"azure-workload-identity","reconcileID":"fc1561fa-9ef1-42b9-8578-21e9585b9ff4"}

Reconcile log of a update HelmRelease:

{"level":"info","ts":"2024-01-03T15:36:07.133Z","msg":"HelmChart/flux-system/flux-system-cert-exporter with SourceRef 'HelmRepository/flux-system/guida-mirror' is in-sync","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"cert-exporter","namespace":"flux-system"},"namespace":"flux-system","name":"cert-exporter","reconcileID":"5677d347-ebf3-48a5-8c98-3caa26b24dc9"}
{"level":"info","ts":"2024-01-03T15:36:07.348Z","msg":"release in-sync with desired state","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"cert-exporter","namespace":"flux-system"},"namespace":"flux-system","name":"cert-exporter","reconcileID":"5677d347-ebf3-48a5-8c98-3caa26b24dc9"}

All seems happy in the helm-controller to me :)

Code of Conduct

  • I agree to follow this project's Code of Conduct
@wilmardo
Copy link
Author

wilmardo commented Jan 3, 2024

Let me know If I can provide some more information. I can easily recreate this behavior on my local cluster consistently.

@siegenthalerroger
Copy link

I can confirm that this also happened to me even with the 2.2.2 release.

@wilmardo
Copy link
Author

wilmardo commented Jan 9, 2024

This might be related:
#4529

@razvanphp
Copy link

razvanphp commented Jan 14, 2024

This is probably related:

# flux reconcile helmrelease rabbitmq
✗ failed to get API group resources: unable to retrieve the complete list of server APIs: helm.toolkit.fluxcd.io/v2beta2: the server could not find the requested resource

the thing is, my helmrelease has apiVersion v2beta1 not v2beta2 and my check command does not even show beta2:

# flux check
► checking prerequisites
✔ Kubernetes 1.26.6+k3s-e18037a7-dirty >=1.26.0-0
► checking version in cluster
✔ distribution: flux-v2.1.0
✔ bootstrapped: true
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.36.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.1.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.1.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.1.0
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta2
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta2
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed

Not sure why it says distribution-2.1.0, because I have:

# flux --version
flux version 2.2.2

After reverting to flux 2.1.0, everything works again.

@wilmardo
Copy link
Author

wilmardo commented Jan 15, 2024

@razvanphp Your issue isn't related perse. The error you see is, because your controller in the cluster is still at 2.1.0 and your CLI has been updated to 2.2.2.
The 2.2.x CLI isn't backwards compatible with the 2.1.x release:

Yes, the CLI ensures backwards compatibility only for GA APIs, for beta versions you need a CLI that matches the cluster version.
#4490 (comment)

@stefanprodan
Copy link
Member

Can you please post here the kubectl get hr -o yaml --show-managed-fields for cert-manager or any of the dependant HRs.

@razvanphp
Copy link

@razvanphp Your issue isn't related perse. The error you see is, because your controller in the cluster is still at 2.1.0 and your CLI has been updated to 2.2.2. The 2.2.x CLI isn't backwards compatible with the 2.1.x release:

Yes, the CLI ensures backwards compatibility only for GA APIs, for beta versions you need a CLI that matches the cluster version.
#4490 (comment)

Indeed, thank you for your answer! Sorry for the noob question...

@darkowlzz
Copy link
Contributor

@wilmardo can you please provide some detailed instructions to reproduce this issue? Based on your issue description, I tried a few things but couldn't reproduce it. Some detailed steps with example configuration or even a test repository with just the necessary configurations to help reproduce it would be very helpful.

@wilmardo
Copy link
Author

Yes! Will get back to this, busy with other thing at the moment and we postponed this update for now.
Hopefully in the beginning of next week I have more time to gather info and reproduce the issue again.

@darkowlzz Will try to get something together but I don't know if it might be something very specific to our in-house stuff that is triggering this.

@wilmardo
Copy link
Author

wilmardo commented Jan 30, 2024

This might be related although the issue is a bit vague:
fluxcd/helm-controller#891

@darkowlzz
Copy link
Contributor

darkowlzz commented Feb 3, 2024

Hi, we got another report of a similar issue today on slack and that revealed some helpful hints to the issue. I created a potential theory for what's causing this and some potential solutions for it. Refer fluxcd/helm-controller#884 and fluxcd/helm-controller#885 for details about it.

I can briefly explain the observations here too. The "dependency is not ready" may not be the actual issue here. It's more likely that the reconciliation failed once with this error and on a subsequent reconciliation it went past the dependency check but the old Ready status persisted on the object and reconciliation entered a drift detection and correction loop due to some other controller/entity in the cluster which reverted/modified the configurations applied by the helmrelease. fluxcd/helm-controller#855 is an example of this situation and how it can be handled using drift detection ignore rules. Refer https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection for detailed docs. Another way to verify the issue would be to look at the events and logs associated with the HelmRelease. They should mention about the drift. Debug level logs must be enabled to see the details about the detected drift, as described in the docs.

I've shared some more details about my attempts to reproduce this issue in fluxcd/helm-controller#885 (comment). Based on that, I think the changes in fluxcd/helm-controller#885 should make the situation better and surface the actual issue. It would be great if people who are facing this issue can try the preview image of that PR using

ghcr.io/fluxcd/helm-controller:preview-ac9e62ad@sha256:b3d9cc5e440f0b8ed83c1d5832c6f49a7e648f70e8093e85595902cd4891b9b3

It's an official preview image built using the flux release infrastructure, refer https://github.com/fluxcd/helm-controller/actions/runs/7762775568/job/21173786393.

The preview image can help surface the actual underlying issue. Once the drift issue is resolved, the helm-controller can be reverted to the previous version as that works fine, just the status reporting made it confusing.

@darkowlzz
Copy link
Contributor

Hi, Flux v2.2.3 has been released with fluxcd/helm-controller#884 to help with the issue reported here. Instead of the test image I shared in the last comment, please upgrade to Flux v2.2.3 and see if it helps surface the potential drift detection and correction issue as described in detail above. The status wouldn't mention about drift explicitly yet but will show that the HelmRelease is being processed, not in a failed state. Please check the events of the particular HelmRelease and the logs, as documented in https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection, to see if they have conflict in drift correction that's causing the release to not complete successfully. In a future release, we may add explicit message about drift correction as described in fluxcd/helm-controller#885 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants