New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: re-open encrypted disk during osd-prepare-job if closed #11338
Conversation
84d2183
to
ecf050d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include unit tests to make sure the stdin is working properly. Can we get a string that simulates a LUKS device?
pkg/daemon/ceph/osd/volume.go
Outdated
// Let's re-open the block to re-hydrate the OSDInfo. | ||
logger.Debugf("encrypted block device %q is not open, opening it now", block) | ||
passphrase := os.Getenv(oposd.CephVolumeEncryptedKeyEnvVarName) | ||
target := fmt.Sprintf("%s-%s", os.Getenv(oposd.PVCNameEnvVarName), oposd.DmcryptBlockType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns about this. I'm not sure how strictly guaranteed the behavior we are relying on is. I wonder if there are kubernetes versions or variants where the pvc name is not part of the dm name.
I think we should be able to use the encryptedBlock
variable here, no? That should contain -dmcrypt
according to the lines above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns about this. I'm not sure how strictly guaranteed the behavior we are relying on is. I wonder if there are kubernetes versions or variants where the pvc name is not part of the dm name.
This target name can be anything since we'll be opening luks with this target explicitly and call c-v on this.
I think we should be able to use the
encryptedBlock
variable here, no? That should contain-dmcrypt
according to the lines above.
c-v prefers to have /dev/mapper/<name>
format, that's why i kept both the variables separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. target
is what luksOpen
uses for its name
parameter and is created by opening the disk. Thanks :)
pkg/daemon/ceph/osd/encryption.go
Outdated
func openLUKS(context *clusterd.Context, disk, target, passphrase string) error { | ||
args := []string{"luksOpen", "--verbose", "--allow-discards", disk, target} | ||
err := context.Executor.ExecuteCommandWithStdin(cryptsetupBinary, &passphrase, args...) | ||
if err != nil { | ||
logger.Errorf("failed to open LUKS device %q", disk) | ||
return err | ||
} | ||
|
||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice this doesn't have a timeout. For the prepare job, I know we have been more lax on resource requests/timeouts in the past. IMO, it would still be good to have a timeout just to make sure we can continue with provisioning if a disk is slow to respond or some other reason. For LUKS, I think 2 1/2 minutes (150 seconds) should be good given that a device should normally open within 30 seconds.
@travisn wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why go 5 times the 30s? Either way the timeout seems plenty long so I don't have a strong opinion other than it seems like 30s would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why go 5 times the 30s? Either way the timeout seems plenty long so I don't have a strong opinion other than it seems like 30s would be sufficient.
Set it to 30, I think that's sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest a higher value so that we don't get flakiness in our tests because of the value. A lot of times, when there are resource-constrainted environments, what is "supposed" to happen timing-wise takes much longer. I use 5x as a rule of thumb because when I've used 3x, there is often still flakiness. Maybe 120 seconds is good, but I would really suggest using more than 30 and even more than 90.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some digging. Luks uses a static 2 seconds to open a disk with a key, which is strange that it's static, but it apparently is so. There can be 8 key slots, which means it could take 16 total seconds for each unlock attempt with each key. Plus any overhead for CPU time. I believe the normal case is that the unlock should take less than 3 seconds, but let's not risk create a flaky system for our CI or for corner-case users by using too aggressive of a timeout. I still think having at least a 90 second timeout is reasonable.
The most common case will be that it finishes successfully in ~2 seconds. In the event of some corruption, it will take 1.5 minutes. If I/O to the disk is stalling, it will take 1.5 minutes for each disk to fail. For our CI, I/O stalling seems like it's not uncommon, and there are many users who try to run on R-Pis and other small systems.
On a different angle, if a server with 60 disks attached to a SAS controller has a controller failure such that I/Os stall indefinitely without a noticeable failure, the prepare job won't fail until after 1.5 hours. That is quite a while, but IMO if a user has that many disks on a node, they likely expect installation and upgrades will take a fair amount of time, and 1.5 hours isn't a lot compared to a rolling upgrade that could take a week for large clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to 90,
92281df
to
b22605c
Compare
pkg/daemon/ceph/osd/encryption.go
Outdated
args := []string{"luksOpen", "--verbose", "--allow-discards", disk, target} | ||
err := context.Executor.ExecuteCommandWithStdin(luksOpenCmdTimeOut, cryptsetupBinary, &passphrase, args...) | ||
if err != nil { | ||
logger.Errorf("failed to open LUKS device %q", disk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why both log and return err? The error is going to be surfaced eventually so just returning a wrapped error seems sufficient.
pkg/daemon/ceph/osd/volume.go
Outdated
// Let's re-open the block to re-hydrate the OSDInfo. | ||
logger.Debugf("encrypted block device %q is not open, opening it now", block) | ||
passphrase := os.Getenv(oposd.CephVolumeEncryptedKeyEnvVarName) | ||
target := fmt.Sprintf("%s-%s", os.Getenv(oposd.PVCNameEnvVarName), oposd.DmcryptBlockType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have a helper for this somewhere in the OSD code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found, exported and using it now.
3270c13
to
9909db9
Compare
Added some tests to simulate stdin and timeout.
I didn't get what you mean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly nits
pkg/daemon/ceph/osd/encryption.go
Outdated
dmsetupBinary = "dmsetup" | ||
cryptsetupBinary = "cryptsetup" | ||
dmsetupBinary = "dmsetup" | ||
luksOpenCmdTimeOut = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite long for a timeout, I would think that 10sec is more than enough already.
pkg/daemon/ceph/osd/encryption.go
Outdated
@@ -160,6 +162,16 @@ func dumpLUKS(context *clusterd.Context, disk string) (string, error) { | |||
return cryptsetupOut, nil | |||
} | |||
|
|||
func openLUKS(context *clusterd.Context, disk, target, passphrase string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func openLUKS(context *clusterd.Context, disk, target, passphrase string) error { | |
func openEncryptedDevice(context *clusterd.Context, disk, target, passphrase string) error { |
pkg/daemon/ceph/osd/encryption.go
Outdated
args := []string{"luksOpen", "--verbose", "--allow-discards", disk, target} | ||
err := context.Executor.ExecuteCommandWithStdin(luksOpenCmdTimeOut, cryptsetupBinary, &passphrase, args...) | ||
if err != nil { | ||
return errors.Wrapf(err, "failed to open LUKS device %q", disk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return errors.Wrapf(err, "failed to open LUKS device %q", disk) | |
return errors.Wrapf(err, "failed to open encrypted device %q", disk) |
pkg/daemon/ceph/osd/volume.go
Outdated
return nil, errors.Errorf("failed to find the encrypted block device for %q, not opened?", block) | ||
} | ||
// The encrypted block is not opened. | ||
// The OSD deployment has been removed manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the mention of the node reboot? If the deployment is removed manually the encrypted block is left open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the mention of the node reboot? If the deployment is removed manually the encrypted block is left open.
This is not the only case, the encrypted block is closed when csi provisioned pvc is unmounted due to osd pod deletion too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clear, please clarify the scenario with CSI. If we have more than the reboot then they should all be listed in the code as comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clear, please clarify the scenario with CSI. If we have more than the reboot then they should all be listed in the code as comments.
I dont think we can assume that a encrypted device opened by a pod remains open after pod is deleted.
IMO The case of rook using disks directly on node/ intree driver provisioned pvc leaving the encrypted device open when
the pod rebooted is the corner case. I am thinking we should go ahead with the assumption the encrypted device is no longer open once the pod, be it osd-prepare-job or osd, is deleted.
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clear, please clarify the scenario with CSI. If we have more than the reboot then they should all be listed in the code as comments.
I dont think we can assume that a encrypted device opened by a pod remains open after pod is deleted.
Why? What would close it then?
IMO The case of rook using disks directly on node/ intree driver provisioned pvc leaving the encrypted device open when the pod rebooted is the corner case. I am thinking we should go ahead with the assumption the encrypted device is no longer open once the pod, be it osd-prepare-job or osd, is deleted.
Even in the PVC case where Rook consumes a PVC for OSD disk, if that PVC is detached the encrypted block is still open and available but broken. We've had cases like this where the underlying block device disappears.
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍🏻 I don't mean to argue too much too, if you saw it then I trust you. Also as a reference see this code https://github.com/leseb/rook/blob/61279e98201d5d80090f57433a21e297b2f5efb5/pkg/operator/ceph/cluster/osd/spec.go#L172-L212. We still have cases where the PV block is detached and the encrypted block is still around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay 👍
one example is as follows:
This is how it looks on a node with functioning osd
[rakshith@fedora ~]$ oc debug node/ip-10-0-165-109.us-east-2.compute.internal
Starting pod/ip-10-0-165-109us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.165.109
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop1 7:1 0 100G 0 loop
rbd0 252:0 0 40G 0 disk /var/lib/kubelet/pods/d3adec98-e695-4ba3-a110-ff654c5a5f5c/volumes/kubernetes.io~csi/pvc-a3508c23-c949-43bd-bbfe-e00b52
nvme0n1 259:0 0 120G 0 disk
|-nvme0n1p1 259:1 0 1M 0 part
|-nvme0n1p2 259:2 0 127M 0 part
|-nvme0n1p3 259:3 0 384M 0 part /boot
`-nvme0n1p4 259:4 0 119.5G 0 part /sysroot
nvme1n1 259:5 0 50G 0 disk /var/lib/kubelet/pods/2e1983d0-25cf-436c-b289-ce7acc6625c4/volumes/kubernetes.io~csi/pvc-0a2fa1bb-6af1-4ca1-a624-12d9f0
nvme2n1 259:6 0 100G 0 disk
`-ocs-deviceset-0-data-0cljr5-block-dmcrypt
253:0 0 100G 0 crypt
This is how it looks when osd removal is done
[rakshith@fedora ~]$ oc debug node/ip-10-0-152-142.us-east-2.compute.internal
Starting pod/ip-10-0-152-142us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.152.142
If you don't see a command prompt, try pressing enter.
sh-4.4#
sh-4.4# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
rbd0 252:0 0 40G 0 disk /host/var/lib/kubelet/pods/31cdabdd-9f4a-4490-b8a2-cd9a9e1f1458/volumes/kubernetes.io~csi/pvc-76c14ea6-8664-4116-b86e-80a
nvme0n1 259:0 0 120G 0 disk
|-nvme0n1p1 259:1 0 1M 0 part
|-nvme0n1p2 259:2 0 127M 0 part
|-nvme0n1p3 259:3 0 384M 0 part /host/boot
`-nvme0n1p4 259:4 0 119.5G 0 part /host/sysroot
nvme1n1 259:5 0 50G 0 disk /host/var/lib/kubelet/pods/46933254-64a2-43a9-a5b5-d4904ccc2025/volumes/kubernetes.io~csi/pvc-95b398fd-3d76-456d-903f-8bc
When osd-prepare job pod starts
sh-4.4# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 100G 0 loop
rbd0 252:0 0 40G 0 disk /var/lib/kubelet/pods/31cdabdd-9f4a-4490-b8a2-cd9a9e1f1458/volumes/kubernetes.io~csi/pvc-76c14ea6-8664-4116-b86e-80abcf33
nvme0n1 259:0 0 120G 0 disk
|-nvme0n1p1 259:1 0 1M 0 part
|-nvme0n1p2 259:2 0 127M 0 part
|-nvme0n1p3 259:3 0 384M 0 part /boot
`-nvme0n1p4 259:4 0 119.5G 0 part /sysroot
nvme1n1 259:5 0 50G 0 disk /var/lib/kubelet/pods/46933254-64a2-43a9-a5b5-d4904ccc2025/volumes/kubernetes.io~csi/pvc-95b398fd-3d76-456d-903f-8bca598f
nvme3n2 259:7 0 100G 0 disk
sh-4.4# exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but ocs-deviceset-0-data-0cljr5-block-dmcrypt
is copied in /var/lib/ceph/osd/ceph-0/block
, this is the file the OSD uses to run and I believe this file is still there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but
ocs-deviceset-0-data-0cljr5-block-dmcrypt
is copied in/var/lib/ceph/osd/ceph-0/block
, this is the file the OSD uses to run and I believe this file is still there.
that is a file inside the container right?
it goes away with the deletion of osd pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is HostPath.
pkg/daemon/ceph/osd/volume.go
Outdated
target := oposd.EncryptionDMName(pvcName, oposd.DmcryptBlockType) | ||
err = openLUKS(context, block, target, passphrase) | ||
if err != nil { | ||
return nil, errors.Errorf("failed to open encrypted block device %q on %q", block, target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use errors.Wrapf() and report err
.
09af2ac
to
d8d6c66
Compare
d8d6c66
to
f988d7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
logger.Debugf("encrypted block device %q is not open, opening it now", block) | ||
passphrase := os.Getenv(oposd.CephVolumeEncryptedKeyEnvVarName) | ||
if passphrase == "" { | ||
return nil, errors.Errorf("encryption passphrase is empty in env var %q", oposd.CephVolumeEncryptedKeyEnvVarName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To consider in a future PR, we should write the secret to a file instead of using an env var, as mentioned in #10994.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should do that in this PR so that we don't risk forgetting to fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requires changes in quite a lot of places actually.
It is better to do it in a separate future pr.
This commit implements this corner case during osd-prepare job. ``` The encrypted block is not opened, this is an extreme corner case The OSD deployment has been removed manually AND the node rebooted So we need to re-open the block to re-hydrate the OSDInfo. Handling this case would mean, writing the encryption key on a temporary file, then call luksOpen to open the encrypted block and then call ceph-volume to list against the opened encrypted block. We don't implement this, yet and return an error. ``` When underlying PVC for osd are CSI provisioned, the encrypted device is closed when PVC is unmounted due to osd pod being deleted. Therefore, this may occur more frequently and needs to be handled. This commit implements the fix for the same. Signed-off-by: Rakshith R <rar@redhat.com>
f988d7f
to
bde286e
Compare
osd: re-open encrypted disk during osd-prepare-job if closed (backport #11338)
Description of your changes:
This commit implements this corner case during osd-prepare job.
When underlying PVC for osd are CSI provisioned, the encrypted device is closed when PVC is unmounted due to osd pod being deleted. Therefore, this is no longer a corner case and needs to be handled. This commit implements the fix for the same.
Signed-off-by: Rakshith R rar@redhat.com
refer: #9434
Steps:
Before fix:
After fix: