Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Ready condition during drift correction #885

Merged
merged 1 commit into from Apr 29, 2024

Conversation

darkowlzz
Copy link
Contributor

@darkowlzz darkowlzz commented Feb 2, 2024

Related to the issue described in fluxcd/flux2#4524 and observations from #855. And another instance of the issue discussed on slack that revealed that stale Ready=False value results in such scenarios (described in #884 and more below).

This change builds on top of #884, to solve the issue at different levels. Please read #884 for more details about the issue.

Update the Ready condition during drift correction to reflect the current state of reconciliation. Without this, any previous Ready condition value continues to persist on the object. If there was a previous failure due to which Ready=False condition is present on the object, the same value continues to persist if the atomic release reconciliation enters a drift detection and correction loop. Resulting in the status to show inaccurate state of the reconciliation.

Examples of two different scenarios that arise due to this issue:

  • If a release without any dependency is installed, the status shows Ready=True for InstallSucceeded reason. But right after the installation, if a drift is detected the status continues to show the same Ready=True value. There's no indication that a drift correction is going on in the status. The events and logs do show that drift correction is taking place. But it can be confusing to see positive Ready value. Also, since the Ready condition message is copied for Reconciling condition, Reconciling=True with a "Helm install succeeded..." is observed, refer
    if remediation := req.Object.GetActiveRemediation(); remediation == nil || !remediation.RetriesExhausted(req.Object) {
    conditions.MarkReconciling(req.Object, meta.ProgressingWithRetryReason, conditions.GetMessage(req.Object, meta.ReadyCondition))
    return ErrMustRequeue
    }
    . This scenario was reported in Metallb installation with driftDetection: mode: enabled failed to apply revision #855 and it results in the following status:
status:
  conditions:
  - lastTransitionTime: "2023-12-19T11:22:43Z"
    message: Helm install succeeded for release metallb-system/metallb.v1 with chart
      metallb@0.13.12
    observedGeneration: 4
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2023-12-18T20:11:17Z"
    message: Helm install succeeded for release metallb-system/metallb.v1 with chart
      metallb@0.13.12
    observedGeneration: 1
    reason: InstallSucceeded
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-12-18T20:11:17Z"
    message: Helm install succeeded for release metallb-system/metallb.v1 with chart
      metallb@0.13.12
    observedGeneration: 1
    reason: InstallSucceeded
    status: "True"
    type: Released
  • If a release depends on another release, and reconciliation results in dependency not ready error at first, Ready=False condition is added on the object. On subsequent runs, even when the dependencies are ready, the Ready=False condition isn't updated, resulting in stale Ready value until atomic release reconciliation completes. But if the atomic reconciliation enters a drift detection and correction loop, the Ready=False with dependency error persists in the status. This gives the impression that something is wrong with dependency check but based on the logs and events, the controller could be stuck in drift detection and correction loop. This scenario was reported on slack and may also be the cause of the issue reported in Upgrade from Flux 2.1.x to 2.2.2 leaves most HelmReleases in a broken state flux2#4524. This results in the following status:
Status:
  Conditions:
    Last Transition Time:  2024-02-02T13:35:45Z
    Message:               dependency 'platform-flux-system/cilium' is not ready
    Observed Generation:   8
    Reason:                ProgressingWithRetry
    Status:                True
    Type:                  Reconciling
    Last Transition Time:  2024-02-01T15:31:45Z
    Message:               dependency 'platform-flux-system/cilium' is not ready
    Observed Generation:   6
    Reason:                DependencyNotReady
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-01-24T22:22:25Z
    Message:               Helm upgrade succeeded for release platform-external-secrets/external-secrets.v2 with chart external-secrets@0.9.11
    Observed Generation:   3
    Reason:                UpgradeSucceeded
    Status:                True
    Type:                  Released

Updating the Ready condition during drift detection shows the current state of reconciliation, avoiding the confusing scenarios described above. Following status is observed with the fix:

status:
  conditions:
  - lastTransitionTime: "2024-02-02T23:01:02Z"
    message: correcting cluster drift
    observedGeneration: 1
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2024-02-02T23:00:37Z"
    message: correcting cluster drift
    observedGeneration: 1
    reason: Progressing
    status: Unknown
    type: Ready
  - lastTransitionTime: "2024-02-02T22:49:52Z"
    message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
    observedGeneration: 1
    reason: InstallSucceeded
    status: "True"
    type: Released

The Ready=Unknown and Reconciling=True with their messages show correctly about the current state of reconciliation.

@darkowlzz
Copy link
Contributor Author

darkowlzz commented Feb 3, 2024

Some more details based on more testing. I used the metallb example configurations to reproduce fluxcd/flux2#4524 based on the understanding of the issue so far. Since metallb has admission webhooks which cause a drift by injecting CA bundle in CRD, it seems to be a good example to reproduce this easily. The report talks about upgrade which means the release already existed. With helm-controller v0.37.2, I installed metallb helmrelease first in a good state without any dependsOn field and proper ignore rules to prevent drift detection.

To introduce the dependency failure, I updated the helmrelease to add a dependsOn reference to a non-existing release and removed the drift detection ignore rules. This resulted in the following status:

status:
  conditions:
  - lastTransitionTime: "2024-02-02T23:52:31Z"
    message: 'unable to get ''default/foo'' dependency: HelmRelease.helm.toolkit.fluxcd.io
      "foo" not found'
    observedGeneration: 4
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2024-02-02T23:50:59Z"
    message: 'unable to get ''default/foo'' dependency: HelmRelease.helm.toolkit.fluxcd.io
      "foo" not found'
    observedGeneration: 3
    reason: DependencyNotReady
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-02-02T23:47:49Z"
    message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
    observedGeneration: 1
    reason: InstallSucceeded
    status: "True"
    type: Released

Ready=False with DependencyNotReady reason. The message is not the same as in the report but this should be a close enough scenario. Now if I update the release to remove dependsOn field, so as to let the release continue and enter a drift detection-correction loop, the same status was observed, which matches the theory of stale Ready value and drift correction loop.

Now if I update helm-controller with the changes in this PR (which includes #884) using ghcr.io/fluxcd/helm-controller:preview-ac9e62ad@sha256:b3d9cc5e440f0b8ed83c1d5832c6f49a7e648f70e8093e85595902cd4891b9b3 (preview image for the changes in this PR) , the status shows the following:

status:
  conditions:
  - lastTransitionTime: "2024-02-02T23:53:49Z"
    message: correcting cluster drift
    observedGeneration: 4
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2024-02-02T23:53:46Z"
    message: correcting cluster drift
    observedGeneration: 4
    reason: Progressing
    status: Unknown
    type: Ready
  - lastTransitionTime: "2024-02-02T23:47:49Z"
    message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
    observedGeneration: 1
    reason: InstallSucceeded
    status: "True"
    type: Released

The status accurately shows what's going on.
Will ask the issue reporters to try this and see what they get.

@darkowlzz darkowlzz force-pushed the update-stale-ready-condition branch 4 times, most recently from b919354 to 59c577a Compare February 5, 2024 08:01
Base automatically changed from update-stale-ready-condition to main February 5, 2024 08:17
Update the Ready condition during drift correction to reflect the
current state of reconciliation. Without this, any previous Ready
condition value continues to persist on the object. If there was a
previous failure due to which Ready=False condition is present on the
object, the same value continues to persist if the atomic release
reconciliation enters a drift detection and correction loop. Resulting
in the status to show inaccurate state of the reconciliation.

Examples of two different scenarios that arise due to this issue:
- If a release without any dependency is installed, the status shows
  Ready=True for InstallSucceeded reason. But right after the
  installation, if a drift is detected the status continues to show the
  same Ready=True value. There's no indication that a drift correction
  is going on in the status. The events and logs do show that drift
  correction is taking place. But it can be confusing to see positive
  Ready value. Also, since the Ready condition message is copied for
  Reconciling condition, Reconciling=True with a "Helm install
  succeeded..." is seen.
- If a release depends on another release, and reconciliation results in
  dependency not ready error at first, Ready=False condition is added on
  the object. On subsequent runs, even when the dependencies are ready,
  the Ready=False condition isn't updated, resulting in stale Ready
  value until atomic release reconciliation completes. But if the atomic
  reconciliation enters a drift detection and correction loop, the
  Ready=False with dependency error persists in the status. This gives
  the impression that something is wrong with dependency check but based
  on the logs and events, the controller could be stuck in drift
  detection and correction loop.

Updating the Ready condition during drift detection shows the current
state of reconciliation, avoiding the confusing scenarios described
above.

Signed-off-by: Sunny <github@darkowlzz.space>
@darkowlzz
Copy link
Contributor Author

I just did another manual testing for this to see what happens to the ready condition with this change once the drift gets corrected successfully, since the focus before was on getting the condition correct during the failure.

Status during the drift correction failure:

  status:
    conditions:
    - lastTransitionTime: "2024-04-29T10:44:21Z"
      message: correcting cluster drift
      observedGeneration: 1
      reason: ProgressingWithRetry
      status: "True"
      type: Reconciling
    - lastTransitionTime: "2024-04-29T10:43:54Z"
      message: correcting cluster drift
      observedGeneration: 1
      reason: Progressing
      status: Unknown
      type: Ready
    - lastTransitionTime: "2024-04-29T10:43:54Z"
      message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
      observedGeneration: 1
      reason: InstallSucceeded
      status: "True"
      type: Released
    helmChart: default/default-metallb
    history:
    - chartName: metallb
      chartVersion: 0.13.12
      configDigest: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
      digest: sha256:92dd7f2e21d9481ee317110bd887fd218cc6972cb3717b52a32e448ce6204f4b
      firstDeployed: "2024-04-29T10:43:20Z"
      lastDeployed: "2024-04-29T10:43:20Z"
      name: metallb
      namespace: default
      status: deployed
      version: 1
    lastAttemptedConfigDigest: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
    lastAttemptedGeneration: 1
    lastAttemptedReleaseAction: install
    lastAttemptedRevision: 0.13.12
    observedGeneration: -1
    storageNamespace: default

Ready condition status is Unknown with message correcting cluster drift.

After fixing the configuration to allow drift correction:

  status:
    conditions:
    - lastTransitionTime: "2024-04-29T10:46:39Z"
      message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
      observedGeneration: 2
      reason: InstallSucceeded
      status: "True"
      type: Ready
    - lastTransitionTime: "2024-04-29T10:43:54Z"
      message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
      observedGeneration: 1
      reason: InstallSucceeded
      status: "True"
      type: Released
    helmChart: default/default-metallb
    history:
    - chartName: metallb
      chartVersion: 0.13.12
      configDigest: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
      digest: sha256:92dd7f2e21d9481ee317110bd887fd218cc6972cb3717b52a32e448ce6204f4b
      firstDeployed: "2024-04-29T10:43:20Z"
      lastDeployed: "2024-04-29T10:43:20Z"
      name: metallb
      namespace: default
      status: deployed
      version: 1
    lastAppliedRevision: 0.13.12
    lastAttemptedConfigDigest: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
    lastAttemptedGeneration: 2
    lastAttemptedReleaseAction: install
    lastAttemptedRevision: 0.13.12
    observedGeneration: 2
    storageNamespace: default

Ready status condition gets updated with correct values, status is True and message is Helm install succeeded for release....

@darkowlzz darkowlzz marked this pull request as ready for review April 29, 2024 10:54
@darkowlzz darkowlzz force-pushed the drift-correction-ready-condition branch from 9def86e to 56478cf Compare April 29, 2024 11:13
@stefanprodan stefanprodan added the bug Something isn't working label Apr 29, 2024
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @darkowlzz 🏅

@stefanprodan stefanprodan merged commit b31701e into main Apr 29, 2024
7 checks passed
@stefanprodan stefanprodan deleted the drift-correction-ready-condition branch April 29, 2024 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants