osd: Increase wait timeout for osd prepare cleanup #9116
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of your changes:
When a reconcile is started for OSDs, the prepare jobs are first deleted from a previous reconcile. The timeout for the osd prepare job deletion was only 40s. After that timeout, the reconcile attempts to continue waiting for the pod, but of course will never complete since the OSD prepare was not running in the first place, causing the reconcile to wait indefinitely. In the reported issue, the osd prepare jobs were actually deleted successfully, the timeout just wasn't long enough. Pods need at least a minute to be forcefully deleted, so we increase the timeout to 90s to give it some extra buffer.
At the same time, if the osd prepare job deletion fails, we should treat it as a reconcile error instead of ignoring it.
As described in #8558, the operator log shows that the 40s timeout wasn't enough:
Then the operator waited forever with this message:
Which issue is resolved by this Pull Request:
Resolves #8558
Checklist:
make codegen
) has been run to update object specifications, if necessary.