Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: Increase wait timeout for osd prepare cleanup #9116

Merged
merged 1 commit into from Nov 5, 2021

Conversation

travisn
Copy link
Member

@travisn travisn commented Nov 5, 2021

Description of your changes:
When a reconcile is started for OSDs, the prepare jobs are first deleted from a previous reconcile. The timeout for the osd prepare job deletion was only 40s. After that timeout, the reconcile attempts to continue waiting for the pod, but of course will never complete since the OSD prepare was not running in the first place, causing the reconcile to wait indefinitely. In the reported issue, the osd prepare jobs were actually deleted successfully, the timeout just wasn't long enough. Pods need at least a minute to be forcefully deleted, so we increase the timeout to 90s to give it some extra buffer.

At the same time, if the osd prepare job deletion fails, we should treat it as a reconcile error instead of ignoring it.

As described in #8558, the operator log shows that the 40s timeout wasn't enough:

2021-10-14 11:30:56.979921 I | op-k8sutil: batch job rook-ceph-osd-prepare-817e216fd497e6c6ffe59bff22905fd7 still exists
2021-10-14 11:30:58.980334 W | op-k8sutil: gave up waiting for batch job rook-ceph-osd-prepare-817e216fd497e6c6ffe59bff22905fd7 to be deleted
2021-10-14 11:30:59.049129 I | op-osd: letting preexisting OSD provisioning job run to completion for node "kaas-node-7157e7a5-2749-4564-bd2e-274ac669b9f7"

Then the operator waited forever with this message:

2021-10-14 11:31:59.049399 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
2021-10-14 11:32:59.050336 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated

Which issue is resolved by this Pull Request:
Resolves #8558

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

Comment on lines +106 to +109
retries := 30
sleepInterval := 3 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I could suggest is adding a comment here that explains this case so we don't change this in the future.

I'm not sure if it would be truly useful to add unit/integration tests for this given that it is really about the behavior of kubernetes and the code's understanding of it.

When a reconcile is started for OSDs, the prepare jobs are first
deleted from a previous reconcile. The timeout for the osd prepare
job deletion was only 40s. After that timeout, the reconcile attempts
to continue waiting for the pod, but of course will never complete
since the OSD prepare was not running in the first place, causing the
reconcile to wait indefinitely. In the reported issue, the osd prepare
jobs were actually deleted successfully, the timeout just wasn't long
enough. Pods need at least a minute to be forcefully deleted,
so we increase the timeout to 90s to give it some extra buffer.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
@travisn travisn merged commit 6f4a68c into rook:master Nov 5, 2021
mergify bot added a commit that referenced this pull request Nov 6, 2021
osd: Increase wait timeout for osd prepare cleanup (backport #9116)
@travisn travisn deleted the osd-prepare-cleanup-timeout branch February 24, 2022 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rook operator fails to properly manage one osd-prepare-job
2 participants