Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: bundledeployment install failures should eventually settle #596

Conversation

joelanford
Copy link
Member

@joelanford joelanford commented Nov 4, 2022

If a BD managed by the plain provisioner fails to install or upgrade, the plain provisioner performs a rollback to restore the original state before the install/upgrade was tried.

When these errors occur, the controller enters an exponential backoff, and subsequent install/upgrade attempts are made. Initial backoff attempts can start quickly enough that the rollback from the previous failed attempt has not yet fully completed in the background. For example, a namespace that was deleted in a previous rollback may still be terminating by the time the next reconcile attempt happens.

In these cases, we expect the exponential backoff to eventually wait long enough for these cleanup issues to have resolved themselves such that continued attempts stop interfering with the cleanup from the previous attempt.

We report these errors in the BD "Installed" condition. However this presents a problem. When the status changes (from real failure to ephemeral failure, a controller handler requeues the BD for immediate reconciliation. This partially foils the exponential backoff. What results is 3 reconcile attempts every backoff attempt (rather than the expected 1):

  • Attempt 1: fails for the original reason, provisioner rollsback the failure, and updates the BD status to report the failure reason.
  • Attempt 2: (happens immediately due to status update) fails again, this time due to transient issue with cleanup of attempt 1, provisioner rollsback the failure, and again updates the BD status, this time to report the transient error.
  • Attempt 3: (again happens immediately due to another status update) fails again for the same reason as attempt 2, but this time the failure message is the same, so no status update and no immediate reconcile. After attempt 3, the backoff timer is once again in effect.

The reason this is problematic is that the BD spends most of its time with the status set to the result of attempt 3, and this gives users no actionable information about how to resolve the failure.

To solve this, this commit introduces waits (with 30 second timeouts) to enable to the failed install or upgrade to be cleaned up prior to the reconcile function returning, which means that subsequent reconcile attempts will generally (unless the cleanup timed out) be starting from a clean slate.

The introduction of these waits also means we need to run multiple controller threads so that waits are less likely to completely starve the controller of making progress against its reconcile queue.

@joelanford joelanford requested a review from a team as a code owner November 4, 2022 21:33
@joelanford joelanford force-pushed the avoid-bd-failure-requeue-cycle branch from b92249e to 9722148 Compare November 4, 2022 22:53
@akihikokuroda
Copy link
Member

Changes look good to me. Can I have the example bundle to see this issue?

@timflannagan
Copy link
Contributor

Talked about this in slack. See https://kubernetes.slack.com/archives/C038B7MF75M/p1667597668853099 for more information.

@joelanford
Copy link
Member Author

/hold

I think I'm going to refactor this to use operator-framework/helm-operator-plugins#196 when it merges.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 11, 2022
…clean up to complete

If a BD managed by the plain provisioner fails to install or upgrade,
the plain provisioner performs a rollback to restore the original state
before the install/upgrade was tried.

When these errors occur, the controller enters an exponential backoff,
and subsequent install/upgrade attempts are made. Initial backoff
attempts can start quickly enough that the rollback from the previous
failed attempt has not yet fully completed in the background. For
example, a namespace that was deleted in a previous rollback may still
be terminating by the time the next reconcile attempt happens.

In these cases, we expect the exponential backoff to eventually wait
long enough for these cleanup issues to have resolved themselves such
that continued attempts stop interfering with the cleanup from the
previous attempt.

We report these errors in the BD "Installed" condition. However this
presents a problem. When the status changes (from real failure to
ephemeral failure, a controller handler requeues the BD for immediate
reconciliation. This partially foils the exponential backoff. What
results is 3 reconcile attempts every backoff attempt (rather than the
expected 1):

  - Attempt 1: fails for the original reason, provisioner rollsback the
    failure, and updates the BD status to report the failure reason.
  - Attempt 2: (happens immediately due to status update) fails again,
    this time due to transient issue with cleanup of attempt 1,
    provisioner rollsback the failure, and again updates the BD status,
    this time to report the transient error.
  - Attempt 3: (again happens immediately due to another status update)
    fails again for the same reason as attempt 2, but this time the
    failure message is the same, so no status update and no immediate
    reconcile. After attempt 3, the backoff timer is once again in
    effect.

The reason this is problematic is that the BD spends most of its time
with the status set to the result of attempt 3, and this gives users no
actionable information about how to resolve the failure.

To solve this, this commit introduces waits (with 30 second timeouts) to
enable to the failed install or upgrade to be cleaned up prior to the
reconcile function returning, which means that subsequent reconcile
attempts will generally (unless the cleanup timed out) be starting from
a clean slate.

The introduction of these waits also means we need to run multiple
controller threads so that waits are less likely to completely starve
the controller of making progress against its reconcile queue.

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
@joelanford
Copy link
Member Author

/hold cancel

I've refactored this to make use of the updates that landed in operator-framework/helm-operator-plugins#196

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2022
@akihikokuroda
Copy link
Member

Is it safe to let multiple controller thread? I don't have any specific points but should we go through code to make sure it's OK? We probably should add some statements in the doc for contributors to the BD controller must be thread safe.

@github-actions
Copy link

This PR has become stale because it has been open for 30 days with no activity. Please update this PR or remove the lifecycle/stale label before it is automatically closed in 30 days. Adding the lifecycle/frozen label will cause this PR to ignore lifecycle events.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2022
@github-actions
Copy link

This PR has been closed as no updates were detected after 30 days of being stale. Please feel free to reopen this PR if necessary.

@github-actions github-actions bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 23, 2023
@github-actions github-actions bot closed this Jan 23, 2023
@joelanford joelanford reopened this Mar 29, 2023
@joelanford joelanford added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 29, 2023
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2023
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@joelanford
Copy link
Member Author

In a discussion recently, I suggested that we may not want to attempt uninstall/rollback when install/upgrade fails. If we stopped doing that, I think this PR would become obsolete, because it is caused by the uninstall/rollback/retry loop.

@joelanford joelanford closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants