osd: handle removal of encrypted osd deployment #9434

leseb · 2021-12-15T17:36:07Z

This is handling a tricky scenario where the OSD deployment is manually
removed and the OSD never reconvers. This is unlikely to happen, but
still OSD should be able to run after that action. Essentially after a
manual deletion, we need to run the prepare job again to re-hydrate the
OSD information so that the OSD deployment can be deployed.
On encryption, it is a little bit tricky since ceph-volume list again
the main block won't return anything, so we need to target the encrypted
block to list.

There is another case this PR does not handle, which is the removal of
the OSD deployment and then the node is restarted. This means that the
encrypted container is not opened anymore. However, opening it requires
more work like writing the key on the filesystem (if not coming from the
Kubernete secret, eg,. KMS vault) and then run luksOpen. This is an
extreme corner case probably not worth worrying about for now.

Signed-off-by: Sébastien Han seb@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

BlaineEXE · 2021-12-16T22:25:41Z

As far as code, the logic here is a little hard to follow, but I trust that it's the right thing to do here. I am having a little bit of trouble trying to understand the underlying case and mode of failure.

Notably, I'm not quite following the below text from the commit message:

... so that the OSD deployment can be deployed.

BlaineEXE · 2021-12-16T22:27:28Z

pkg/daemon/ceph/osd/volume.go

-			logger.Infof("failed to get devices already provisioned by ceph-volume. %v", err)
-		}
-		osds = append(osds, lvmOsds...)
+		} else {


Why we need this to be else? Versus just continuing on after the if?

I just felt that the else really breaks the statements, it feels easier to read and parse that large code block.

BlaineEXE · 2021-12-16T22:28:30Z

Also, a bit of a complaint: what is the good of ceph-volume if we have to do utterly so much work to work around it and have so many special cases?

leseb · 2021-12-17T13:59:01Z

As far as code, the logic here is a little hard to follow, but I trust that it's the right thing to do here. I am having a little bit of trouble trying to understand the underlying case and mode of failure.

Notably, I'm not quite following the below text from the commit message:

... so that the OSD deployment can be deployed.

Let's say deploy/rook-ceph-osd-0 is removed manually. The operator reconciles the CephCluster on deletion events for the OSD deployment. At this stage, the operator must run the prepare job again to re-initialize the OSD information in order to run the OSD deployment again.
That's what I meant by this sentence. Hope if clarifies.

leseb · 2021-12-17T14:00:24Z

Also, a bit of a complaint: what is the good of ceph-volume if we have to do utterly so much work to work around it and have so many special cases?

The problem is that the raw/pvc case in c-v has been left behind a little bit and the encryption piece is only used by Rook as far as I can tell. Also, it's hard for c-v to know all the details we do in Rook.

travisn

Have we seen this issue happen in production? Seems like it's only a test scenario where someone intentionally deleted an osd deployment and we need to go create it again.
A couple small suggestions, and approving since i'll be out this week...

.github/workflows/canary-integration-test.yml

pkg/operator/ceph/cluster/osd/create.go

This is handling a tricky scenario where the OSD deployment is manually removed and the OSD never reconvers. This is unlikely to happen, but still OSD should be able to run after that action. Essentially after a manual deletion, we need to run the prepare job again to re-hydrate the OSD information so that the OSD deployment can be deployed. On encryption, it is a little bit tricky since ceph-volume list again the main block won't return anything, so we need to target the encrypted block to list. There is another case this PR does not handle, which is the removal of the OSD deployment and then the node is restarted. This means that the encrypted container is not opened anymore. However, opening it requires more work like writing the key on the filesystem (if not coming from the Kubernete secret, eg,. KMS vault) and then run luksOpen. This is an extreme corner case probably not worth worrying about for now. Signed-off-by: Sébastien Han <seb@redhat.com>

leseb · 2021-12-21T14:18:13Z

Comments have been addressed.

osd: handle removal of encrypted osd deployment (backport #9434)

leseb force-pushed the bz-2032656 branch 5 times, most recently from b5805ee to f2c0384 Compare December 16, 2021 14:03

leseb added the backport-release-1.8 label Dec 16, 2021

leseb marked this pull request as ready for review December 16, 2021 14:37

leseb requested review from travisn and BlaineEXE December 16, 2021 14:41

leseb force-pushed the bz-2032656 branch from f2c0384 to 78a027a Compare December 16, 2021 14:41

BlaineEXE reviewed Dec 16, 2021

View reviewed changes

leseb requested a review from BlaineEXE December 20, 2021 09:22

travisn approved these changes Dec 20, 2021

View reviewed changes

.github/workflows/canary-integration-test.yml Outdated Show resolved Hide resolved

pkg/operator/ceph/cluster/osd/create.go Show resolved Hide resolved

leseb force-pushed the bz-2032656 branch from 78a027a to 05775b0 Compare December 21, 2021 09:51

leseb merged commit 77b18f0 into rook:master Dec 21, 2021

leseb deleted the bz-2032656 branch December 21, 2021 15:14

mergify bot mentioned this pull request Dec 21, 2021

osd: handle removal of encrypted osd deployment (backport #9434) #9479

Merged

mergify bot added a commit that referenced this pull request Dec 21, 2021

Merge pull request #9479 from rook/mergify/bp/release-1.8/pr-9434

ceae2a4

osd: handle removal of encrypted osd deployment (backport #9434)

Rakshith-R mentioned this pull request Nov 21, 2022

osd: re-open encrypted disk during osd-prepare-job if closed #11338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd: handle removal of encrypted osd deployment #9434

osd: handle removal of encrypted osd deployment #9434

leseb commented Dec 15, 2021 •

edited by BlaineEXE

BlaineEXE commented Dec 16, 2021

BlaineEXE Dec 16, 2021 •

edited

leseb Dec 17, 2021

BlaineEXE commented Dec 16, 2021

leseb commented Dec 17, 2021

leseb commented Dec 17, 2021

travisn left a comment

leseb commented Dec 21, 2021

osd: handle removal of encrypted osd deployment #9434

osd: handle removal of encrypted osd deployment #9434

Conversation

leseb commented Dec 15, 2021 • edited by BlaineEXE

BlaineEXE commented Dec 16, 2021

BlaineEXE Dec 16, 2021 • edited

Choose a reason for hiding this comment

leseb Dec 17, 2021

Choose a reason for hiding this comment

BlaineEXE commented Dec 16, 2021

leseb commented Dec 17, 2021

leseb commented Dec 17, 2021

travisn left a comment

Choose a reason for hiding this comment

leseb commented Dec 21, 2021

leseb commented Dec 15, 2021 •

edited by BlaineEXE

BlaineEXE Dec 16, 2021 •

edited