osd: Legacy LVM-based OSDs on PVCs crash on resize init container (backport #14100) #14103
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
OSDs on LVM-mode PVCs are failing to come up and crashing in the expand-bluefs init container. To avoid the crash and allow the OSDs to start, a workaround was found to simply remove that init container. Now we disable the OSD resize for this case to avoid others hitting this during upgrade as well.
I am not able to repro this issue with currently available types of OSDs. All new OSDs on PVCs are being created in raw mode, even for encrypted and if they have a metadata device. But this could affect old OSDs that have been upgraded since long ago (as far back as Rook v1.1).
An error is first since in the "osd init" init container where an argument is missing:
But this does not fail the container since other containers are allowed to continue starting. Then the expand container fails with the below error because the ceph config was not initialized because of the previous init container issue:
This seems related to the removal of some variables that were thought to be obsolete in #11331. However, since we can't find a repro and confirm that adding those back actually fixes the issue, the most reliable and low risk solution seems to be just remove the resize init container complete, and then encourage users to replace these legacy OSDs.
Checklist:
This is an automatic backport of pull request #14100 done by [Mergify](https://mergify.com).