osd: Increase wait timeout for osd prepare cleanup #9116

travisn · 2021-11-05T18:32:55Z

Description of your changes:
When a reconcile is started for OSDs, the prepare jobs are first deleted from a previous reconcile. The timeout for the osd prepare job deletion was only 40s. After that timeout, the reconcile attempts to continue waiting for the pod, but of course will never complete since the OSD prepare was not running in the first place, causing the reconcile to wait indefinitely. In the reported issue, the osd prepare jobs were actually deleted successfully, the timeout just wasn't long enough. Pods need at least a minute to be forcefully deleted, so we increase the timeout to 90s to give it some extra buffer.

At the same time, if the osd prepare job deletion fails, we should treat it as a reconcile error instead of ignoring it.

As described in #8558, the operator log shows that the 40s timeout wasn't enough:

2021-10-14 11:30:56.979921 I | op-k8sutil: batch job rook-ceph-osd-prepare-817e216fd497e6c6ffe59bff22905fd7 still exists
2021-10-14 11:30:58.980334 W | op-k8sutil: gave up waiting for batch job rook-ceph-osd-prepare-817e216fd497e6c6ffe59bff22905fd7 to be deleted
2021-10-14 11:30:59.049129 I | op-osd: letting preexisting OSD provisioning job run to completion for node "kaas-node-7157e7a5-2749-4564-bd2e-274ac669b9f7"

Then the operator waited forever with this message:

2021-10-14 11:31:59.049399 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
2021-10-14 11:32:59.050336 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated

Which issue is resolved by this Pull Request:
Resolves #8558

Checklist:

BlaineEXE · 2021-11-05T19:13:58Z

pkg/operator/k8sutil/job.go

+	retries := 30
+	sleepInterval := 3 * time.Second


The only thing I could suggest is adding a comment here that explains this case so we don't change this in the future.

I'm not sure if it would be truly useful to add unit/integration tests for this given that it is really about the behavior of kubernetes and the code's understanding of it.

When a reconcile is started for OSDs, the prepare jobs are first deleted from a previous reconcile. The timeout for the osd prepare job deletion was only 40s. After that timeout, the reconcile attempts to continue waiting for the pod, but of course will never complete since the OSD prepare was not running in the first place, causing the reconcile to wait indefinitely. In the reported issue, the osd prepare jobs were actually deleted successfully, the timeout just wasn't long enough. Pods need at least a minute to be forcefully deleted, so we increase the timeout to 90s to give it some extra buffer. Signed-off-by: Travis Nielsen <tnielsen@redhat.com>

osd: Increase wait timeout for osd prepare cleanup (backport #9116)

travisn requested a review from BlaineEXE November 5, 2021 18:32

BlaineEXE approved these changes Nov 5, 2021

View reviewed changes

BlaineEXE reviewed Nov 5, 2021

View reviewed changes

travisn force-pushed the osd-prepare-cleanup-timeout branch from cf0d98b to 427996a Compare November 5, 2021 20:20

BlaineEXE approved these changes Nov 5, 2021

View reviewed changes

travisn merged commit 6f4a68c into rook:master Nov 5, 2021

travisn added the backport-release-1.7 label Nov 5, 2021

mergify bot mentioned this pull request Nov 5, 2021

osd: Increase wait timeout for osd prepare cleanup (backport #9116) #9117

Merged

mergify bot added a commit that referenced this pull request Nov 6, 2021

Merge pull request #9117 from rook/mergify/bp/release-1.7/pr-9116

9148ebf

osd: Increase wait timeout for osd prepare cleanup (backport #9116)

travisn deleted the osd-prepare-cleanup-timeout branch February 24, 2022 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd: Increase wait timeout for osd prepare cleanup #9116

osd: Increase wait timeout for osd prepare cleanup #9116

travisn commented Nov 5, 2021

BlaineEXE Nov 5, 2021

osd: Increase wait timeout for osd prepare cleanup #9116

osd: Increase wait timeout for osd prepare cleanup #9116

Conversation

travisn commented Nov 5, 2021

BlaineEXE Nov 5, 2021

Choose a reason for hiding this comment