Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: handle removal of encrypted osd deployment #9434

Merged
merged 1 commit into from Dec 21, 2021
Merged

Conversation

leseb
Copy link
Member

@leseb leseb commented Dec 15, 2021

This is handling a tricky scenario where the OSD deployment is manually
removed and the OSD never reconvers. This is unlikely to happen, but
still OSD should be able to run after that action. Essentially after a
manual deletion, we need to run the prepare job again to re-hydrate the
OSD information so that the OSD deployment can be deployed.
On encryption, it is a little bit tricky since ceph-volume list again
the main block won't return anything, so we need to target the encrypted
block to list.

There is another case this PR does not handle, which is the removal of
the OSD deployment and then the node is restarted. This means that the
encrypted container is not opened anymore. However, opening it requires
more work like writing the key on the filesystem (if not coming from the
Kubernete secret, eg,. KMS vault) and then run luksOpen. This is an
extreme corner case probably not worth worrying about for now.

Signed-off-by: Sébastien Han seb@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@leseb leseb force-pushed the bz-2032656 branch 5 times, most recently from b5805ee to f2c0384 Compare December 16, 2021 14:03
@leseb leseb marked this pull request as ready for review December 16, 2021 14:37
@BlaineEXE
Copy link
Member

As far as code, the logic here is a little hard to follow, but I trust that it's the right thing to do here. I am having a little bit of trouble trying to understand the underlying case and mode of failure.

Notably, I'm not quite following the below text from the commit message:

... so that the OSD deployment can be deployed.

logger.Infof("failed to get devices already provisioned by ceph-volume. %v", err)
}
osds = append(osds, lvmOsds...)
} else {
Copy link
Member

@BlaineEXE BlaineEXE Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this to be else? Versus just continuing on after the if?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just felt that the else really breaks the statements, it feels easier to read and parse that large code block.

@BlaineEXE
Copy link
Member

Also, a bit of a complaint: what is the good of ceph-volume if we have to do utterly so much work to work around it and have so many special cases?

@leseb
Copy link
Member Author

leseb commented Dec 17, 2021

As far as code, the logic here is a little hard to follow, but I trust that it's the right thing to do here. I am having a little bit of trouble trying to understand the underlying case and mode of failure.

Notably, I'm not quite following the below text from the commit message:

... so that the OSD deployment can be deployed.

Let's say deploy/rook-ceph-osd-0 is removed manually. The operator reconciles the CephCluster on deletion events for the OSD deployment. At this stage, the operator must run the prepare job again to re-initialize the OSD information in order to run the OSD deployment again.
That's what I meant by this sentence. Hope if clarifies.

@leseb
Copy link
Member Author

leseb commented Dec 17, 2021

Also, a bit of a complaint: what is the good of ceph-volume if we have to do utterly so much work to work around it and have so many special cases?

The problem is that the raw/pvc case in c-v has been left behind a little bit and the encryption piece is only used by Rook as far as I can tell. Also, it's hard for c-v to know all the details we do in Rook.

@leseb leseb requested a review from BlaineEXE December 20, 2021 09:22
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we seen this issue happen in production? Seems like it's only a test scenario where someone intentionally deleted an osd deployment and we need to go create it again.
A couple small suggestions, and approving since i'll be out this week...

.github/workflows/canary-integration-test.yml Outdated Show resolved Hide resolved
pkg/operator/ceph/cluster/osd/create.go Show resolved Hide resolved
This is handling a tricky scenario where the OSD deployment is manually
removed and the OSD never reconvers. This is unlikely to happen, but
still OSD should be able to run after that action. Essentially after a
manual deletion, we need to run the prepare job again to re-hydrate the
OSD information so that the OSD deployment can be deployed.
On encryption, it is a little bit tricky since ceph-volume list again
the main block won't return anything, so we need to target the encrypted
block to list.
There is another case this PR does not handle, which is the removal of
the OSD deployment and then the node is restarted. This means that the
encrypted container is not opened anymore. However, opening it requires
more work like writing the key on the filesystem (if not coming from the
Kubernete secret, eg,. KMS vault) and then run luksOpen. This is an
extreme corner case probably not worth worrying about for now.

Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb
Copy link
Member Author

leseb commented Dec 21, 2021

Comments have been addressed.

@leseb leseb merged commit 77b18f0 into rook:master Dec 21, 2021
@leseb leseb deleted the bz-2032656 branch December 21, 2021 15:14
mergify bot added a commit that referenced this pull request Dec 21, 2021
osd: handle removal of encrypted osd deployment (backport #9434)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants