-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: bundledeployment install failures should eventually settle #596
fix: bundledeployment install failures should eventually settle #596
Conversation
b92249e
to
9722148
Compare
Changes look good to me. Can I have the example bundle to see this issue? |
Talked about this in slack. See https://kubernetes.slack.com/archives/C038B7MF75M/p1667597668853099 for more information. |
/hold I think I'm going to refactor this to use operator-framework/helm-operator-plugins#196 when it merges. |
…clean up to complete If a BD managed by the plain provisioner fails to install or upgrade, the plain provisioner performs a rollback to restore the original state before the install/upgrade was tried. When these errors occur, the controller enters an exponential backoff, and subsequent install/upgrade attempts are made. Initial backoff attempts can start quickly enough that the rollback from the previous failed attempt has not yet fully completed in the background. For example, a namespace that was deleted in a previous rollback may still be terminating by the time the next reconcile attempt happens. In these cases, we expect the exponential backoff to eventually wait long enough for these cleanup issues to have resolved themselves such that continued attempts stop interfering with the cleanup from the previous attempt. We report these errors in the BD "Installed" condition. However this presents a problem. When the status changes (from real failure to ephemeral failure, a controller handler requeues the BD for immediate reconciliation. This partially foils the exponential backoff. What results is 3 reconcile attempts every backoff attempt (rather than the expected 1): - Attempt 1: fails for the original reason, provisioner rollsback the failure, and updates the BD status to report the failure reason. - Attempt 2: (happens immediately due to status update) fails again, this time due to transient issue with cleanup of attempt 1, provisioner rollsback the failure, and again updates the BD status, this time to report the transient error. - Attempt 3: (again happens immediately due to another status update) fails again for the same reason as attempt 2, but this time the failure message is the same, so no status update and no immediate reconcile. After attempt 3, the backoff timer is once again in effect. The reason this is problematic is that the BD spends most of its time with the status set to the result of attempt 3, and this gives users no actionable information about how to resolve the failure. To solve this, this commit introduces waits (with 30 second timeouts) to enable to the failed install or upgrade to be cleaned up prior to the reconcile function returning, which means that subsequent reconcile attempts will generally (unless the cleanup timed out) be starting from a clean slate. The introduction of these waits also means we need to run multiple controller threads so that waits are less likely to completely starve the controller of making progress against its reconcile queue. Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
9722148
to
5c848f6
Compare
/hold cancel I've refactored this to make use of the updates that landed in operator-framework/helm-operator-plugins#196 |
Is it safe to let multiple controller thread? I don't have any specific points but should we go through code to make sure it's OK? We probably should add some statements in the doc for contributors to the BD controller must be thread safe. |
This PR has become stale because it has been open for 30 days with no activity. Please update this PR or remove the |
This PR has been closed as no updates were detected after 30 days of being stale. Please feel free to reopen this PR if necessary. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In a discussion recently, I suggested that we may not want to attempt uninstall/rollback when install/upgrade fails. If we stopped doing that, I think this PR would become obsolete, because it is caused by the uninstall/rollback/retry loop. |
If a BD managed by the plain provisioner fails to install or upgrade, the plain provisioner performs a rollback to restore the original state before the install/upgrade was tried.
When these errors occur, the controller enters an exponential backoff, and subsequent install/upgrade attempts are made. Initial backoff attempts can start quickly enough that the rollback from the previous failed attempt has not yet fully completed in the background. For example, a namespace that was deleted in a previous rollback may still be terminating by the time the next reconcile attempt happens.
In these cases, we expect the exponential backoff to eventually wait long enough for these cleanup issues to have resolved themselves such that continued attempts stop interfering with the cleanup from the previous attempt.
We report these errors in the BD "Installed" condition. However this presents a problem. When the status changes (from real failure to ephemeral failure, a controller handler requeues the BD for immediate reconciliation. This partially foils the exponential backoff. What results is 3 reconcile attempts every backoff attempt (rather than the expected 1):
The reason this is problematic is that the BD spends most of its time with the status set to the result of attempt 3, and this gives users no actionable information about how to resolve the failure.
To solve this, this commit introduces waits (with 30 second timeouts) to enable to the failed install or upgrade to be cleaned up prior to the reconcile function returning, which means that subsequent reconcile attempts will generally (unless the cleanup timed out) be starting from a clean slate.
The introduction of these waits also means we need to run multiple controller threads so that waits are less likely to completely starve the controller of making progress against its reconcile queue.