fix: bundledeployment install failures should eventually settle #596

joelanford · 2022-11-04T21:33:36Z

If a BD managed by the plain provisioner fails to install or upgrade, the plain provisioner performs a rollback to restore the original state before the install/upgrade was tried.

When these errors occur, the controller enters an exponential backoff, and subsequent install/upgrade attempts are made. Initial backoff attempts can start quickly enough that the rollback from the previous failed attempt has not yet fully completed in the background. For example, a namespace that was deleted in a previous rollback may still be terminating by the time the next reconcile attempt happens.

In these cases, we expect the exponential backoff to eventually wait long enough for these cleanup issues to have resolved themselves such that continued attempts stop interfering with the cleanup from the previous attempt.

We report these errors in the BD "Installed" condition. However this presents a problem. When the status changes (from real failure to ephemeral failure, a controller handler requeues the BD for immediate reconciliation. This partially foils the exponential backoff. What results is 3 reconcile attempts every backoff attempt (rather than the expected 1):

Attempt 1: fails for the original reason, provisioner rollsback the failure, and updates the BD status to report the failure reason.
Attempt 2: (happens immediately due to status update) fails again, this time due to transient issue with cleanup of attempt 1, provisioner rollsback the failure, and again updates the BD status, this time to report the transient error.
Attempt 3: (again happens immediately due to another status update) fails again for the same reason as attempt 2, but this time the failure message is the same, so no status update and no immediate reconcile. After attempt 3, the backoff timer is once again in effect.

The reason this is problematic is that the BD spends most of its time with the status set to the result of attempt 3, and this gives users no actionable information about how to resolve the failure.

To solve this, this commit introduces waits (with 30 second timeouts) to enable to the failed install or upgrade to be cleaned up prior to the reconcile function returning, which means that subsequent reconcile attempts will generally (unless the cleanup timed out) be starting from a clean slate.

The introduction of these waits also means we need to run multiple controller threads so that waits are less likely to completely starve the controller of making progress against its reconcile queue.

akihikokuroda · 2022-11-07T14:47:37Z

Changes look good to me. Can I have the example bundle to see this issue?

timflannagan · 2022-11-07T14:51:25Z

Talked about this in slack. See https://kubernetes.slack.com/archives/C038B7MF75M/p1667597668853099 for more information.

joelanford · 2022-11-11T14:22:22Z

/hold

I think I'm going to refactor this to use operator-framework/helm-operator-plugins#196 when it merges.

…clean up to complete If a BD managed by the plain provisioner fails to install or upgrade, the plain provisioner performs a rollback to restore the original state before the install/upgrade was tried. When these errors occur, the controller enters an exponential backoff, and subsequent install/upgrade attempts are made. Initial backoff attempts can start quickly enough that the rollback from the previous failed attempt has not yet fully completed in the background. For example, a namespace that was deleted in a previous rollback may still be terminating by the time the next reconcile attempt happens. In these cases, we expect the exponential backoff to eventually wait long enough for these cleanup issues to have resolved themselves such that continued attempts stop interfering with the cleanup from the previous attempt. We report these errors in the BD "Installed" condition. However this presents a problem. When the status changes (from real failure to ephemeral failure, a controller handler requeues the BD for immediate reconciliation. This partially foils the exponential backoff. What results is 3 reconcile attempts every backoff attempt (rather than the expected 1): - Attempt 1: fails for the original reason, provisioner rollsback the failure, and updates the BD status to report the failure reason. - Attempt 2: (happens immediately due to status update) fails again, this time due to transient issue with cleanup of attempt 1, provisioner rollsback the failure, and again updates the BD status, this time to report the transient error. - Attempt 3: (again happens immediately due to another status update) fails again for the same reason as attempt 2, but this time the failure message is the same, so no status update and no immediate reconcile. After attempt 3, the backoff timer is once again in effect. The reason this is problematic is that the BD spends most of its time with the status set to the result of attempt 3, and this gives users no actionable information about how to resolve the failure. To solve this, this commit introduces waits (with 30 second timeouts) to enable to the failed install or upgrade to be cleaned up prior to the reconcile function returning, which means that subsequent reconcile attempts will generally (unless the cleanup timed out) be starting from a clean slate. The introduction of these waits also means we need to run multiple controller threads so that waits are less likely to completely starve the controller of making progress against its reconcile queue. Signed-off-by: Joe Lanford <joe.lanford@gmail.com>

joelanford · 2022-11-23T14:27:37Z

/hold cancel

I've refactored this to make use of the updates that landed in operator-framework/helm-operator-plugins#196

akihikokuroda · 2022-11-23T18:45:05Z

Is it safe to let multiple controller thread? I don't have any specific points but should we go through code to make sure it's OK? We probably should add some statements in the doc for contributors to the BD controller must be thread safe.

github-actions · 2022-12-24T00:09:19Z

This PR has become stale because it has been open for 30 days with no activity. Please update this PR or remove the lifecycle/stale label before it is automatically closed in 30 days. Adding the lifecycle/frozen label will cause this PR to ignore lifecycle events.

github-actions · 2023-01-23T00:11:12Z

This PR has been closed as no updates were detected after 30 days of being stale. Please feel free to reopen this PR if necessary.

openshift-merge-robot · 2023-03-29T03:04:57Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

joelanford · 2024-05-21T17:45:00Z

In a discussion recently, I suggested that we may not want to attempt uninstall/rollback when install/upgrade fails. If we stopped doing that, I think this PR would become obsolete, because it is caused by the uninstall/rollback/retry loop.

joelanford requested a review from a team as a code owner November 4, 2022 21:33

joelanford force-pushed the avoid-bd-failure-requeue-cycle branch from b92249e to 9722148 Compare November 4, 2022 22:53

joelanford mentioned this pull request Nov 7, 2022

Expose wait and timeout controls for uninstall/rollbacks after install/upgrade failures operator-framework/helm-operator-plugins#203

Closed

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 11, 2022

joelanford force-pushed the avoid-bd-failure-requeue-cycle branch from 9722148 to 5c848f6 Compare November 23, 2022 14:25

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2022

github-actions bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 23, 2023

github-actions bot closed this Jan 23, 2023

joelanford reopened this Mar 29, 2023

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2023

joelanford closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bundledeployment install failures should eventually settle #596

fix: bundledeployment install failures should eventually settle #596

joelanford commented Nov 4, 2022 •

edited

akihikokuroda commented Nov 7, 2022

timflannagan commented Nov 7, 2022

joelanford commented Nov 11, 2022

joelanford commented Nov 23, 2022

akihikokuroda commented Nov 23, 2022

github-actions bot commented Dec 24, 2022

github-actions bot commented Jan 23, 2023

openshift-merge-robot commented Mar 29, 2023

joelanford commented May 21, 2024

fix: bundledeployment install failures should eventually settle #596

fix: bundledeployment install failures should eventually settle #596

Conversation

joelanford commented Nov 4, 2022 • edited

akihikokuroda commented Nov 7, 2022

timflannagan commented Nov 7, 2022

joelanford commented Nov 11, 2022

joelanford commented Nov 23, 2022

akihikokuroda commented Nov 23, 2022

github-actions bot commented Dec 24, 2022

github-actions bot commented Jan 23, 2023

openshift-merge-robot commented Mar 29, 2023

joelanford commented May 21, 2024

joelanford commented Nov 4, 2022 •

edited