Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: Legacy LVM-based OSDs on PVCs crash on resize init container (backport #14100) #14103

Closed
wants to merge 1 commit into from

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Apr 22, 2024

OSDs on LVM-mode PVCs are failing to come up and crashing in the expand-bluefs init container. To avoid the crash and allow the OSDs to start, a workaround was found to simply remove that init container. Now we disable the OSD resize for this case to avoid others hitting this during upgrade as well.

I am not able to repro this issue with currently available types of OSDs. All new OSDs on PVCs are being created in raw mode, even for encrypted and if they have a metadata device. But this could affect old OSDs that have been upgraded since long ago (as far back as Rook v1.1).

An error is first since in the "osd init" init container where an argument is missing:

Error: ceph-username is required for osd
rook error: ceph-username is required for osd
Usage:
  rook ceph osd init [flags]

But this does not fail the container since other containers are allowed to continue starting. Then the expand container fails with the below error because the ceph config was not initialized because of the previous init container issue:

inferring bluefs devices from bluestore path
unable to read label for /var/lib/ceph/osd/ceph-1: (2) No such file or directory
2024-04-04T13:22:38.461+0000 7f41cddbf900 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (2) No such file or directory

This seems related to the removal of some variables that were thought to be obsolete in #11331. However, since we can't find a repro and confirm that adding those back actually fixes the issue, the most reliable and low risk solution seems to be just remove the resize init container complete, and then encourage users to replace these legacy OSDs.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

This is an automatic backport of pull request #14100 done by [Mergify](https://mergify.com).

OSDs on LVM-mode PVCs are failing to come up and crashing
in the expand-bluefs init container. To avoid the crash
and allow the OSDs to start, a workaround was found to
simply remove that init container. Now we disable the
OSD resize for this case to avoid others hitting this
during upgrade as well.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
(cherry picked from commit acd7b4f)
@satoru-takeuchi
Copy link
Member

Both TestSmokeSuite and TestCephHelmSuite failed with the following error consistently.

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: error validating "": error validating data: ValidationError(CephCluster.spec): unknown field "csi" in io.rook.ceph.v1.CephCluster.spec: exit status 1

This problem seems to be unrelated to this PR.

I confirmed that the previous backport PR to 1.12 release branch was merged without this kind of problem.

@travisn Do you have any thought?

@travisn
Copy link
Member

travisn commented Apr 22, 2024

I see the smoke and helm suites are failing in the release-1.12 history. Agreed it is unrelated. But we don't really plan on another 1.12 release, so I'm going to close this PR as I don't see it critical for 1.12. If we need a 1.12 release in the future, we will need to investigate the CI.

@travisn travisn closed this Apr 22, 2024
@mergify mergify bot deleted the mergify/bp/release-1.12/pr-14100 branch April 22, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants