Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: re-open encrypted disk during osd-prepare-job if closed #11338

Merged
merged 1 commit into from Dec 1, 2022

Conversation

Rakshith-R
Copy link
Member

@Rakshith-R Rakshith-R commented Nov 21, 2022

Description of your changes:

This commit implements this corner case during osd-prepare job.

The encrypted block is not opened, this is an extreme corner case
The OSD deployment has been removed manually AND the node rebooted
So we need to re-open the block to re-hydrate the OSDInfo.

Handling this case would mean, writing the encryption key on a
temporary file, then call luksOpen to open the encrypted block and
then call ceph-volume to list against the opened encrypted block.
We don't implement this, yet and return an error.

When underlying PVC for osd are CSI provisioned, the encrypted device is closed when PVC is unmounted due to osd pod being deleted. Therefore, this is no longer a corner case and needs to be handled. This commit implements the fix for the same.

Signed-off-by: Rakshith R rar@redhat.com

refer: #9434


Steps:

  • Deploy a encrypted cephcluster backed up dynamically CSI provisioned PVC for osd
  • Delete OSD deployment
  • watch osd-prepare-job logs

Before fix:
[rakshith@fedora rook]$ k logs rook-ceph-osd-prepare-set1-data-1s9k9w-5gqcp 
Defaulted container "provision" out of: provision, copy-bins (init), blkdevmapper (init)
2022-11-21 13:53:11.342069 I | cephcmd: desired devices to configure osds: [{Name:/mnt/set1-data-1s9k9w OSDsPerDevice:1 MetadataDevice: DatabaseSizeMB:0 DeviceClass: InitialWeight: IsFilter:false IsDevicePathFilter:false}]
2022-11-21 13:53:11.342608 I | rookcmd: starting Rook v1.10.0-alpha.0.398.gfdc2ab55d with arguments '/rook/rook ceph osd provision'
2022-11-21 13:53:11.342620 I | rookcmd: flag values: --cluster-id=6ec0a6ea-99a5-4281-a70f-322e2ecdd666, --cluster-name=my-cluster, --data-device-filter=, --data-device-path-filter=, --data-devices=[{"id":"/mnt/set1-data-1s9k9w","storeConfig":{"osdsPerDevice":1}}], --encrypted-device=true, --force-format=false, --help=false, --location=, --log-level=DEBUG, --metadata-device=, --node-name=set1-data-1s9k9w, --operator-image=, --osd-crush-device-class=, --osd-crush-initial-weight=, --osd-database-size=0, --osd-wal-size=576, --osds-per-device=1, --pvc-backed-osd=true, --service-account=
2022-11-21 13:53:11.342624 I | op-mon: parsing mon endpoints: a=10.107.169.17:6789
2022-11-21 13:53:11.349240 I | op-osd: CRUSH location=root=default host=set1-data-1s9k9w
2022-11-21 13:53:11.349255 I | cephcmd: crush location of osd: root=default host=set1-data-1s9k9w
2022-11-21 13:53:11.349261 D | cephosd: encryption configuration detecting, populating kek to an env variable
2022-11-21 13:53:11.349274 D | cephosd: cluster-wide encryption is enabled with kubernetes secrets and the kek is attached to the provision env spec
2022-11-21 13:53:11.349341 D | exec: Running command: dmsetup version
2022-11-21 13:53:11.352040 I | cephosd: Library version:   1.02.181-RHEL8 (2021-10-20)
Driver version:    4.43.0
2022-11-21 13:53:11.360727 D | cephclient: No ceph configuration override to merge as "rook-config-override" configmap is empty
2022-11-21 13:53:11.360746 I | cephclient: writing config file /var/lib/rook/enc/enc.config
2022-11-21 13:53:11.360862 I | cephclient: generated admin config in /var/lib/rook/enc
2022-11-21 13:53:11.360988 D | cephclient: config file @ /etc/ceph/ceph.conf:
[global]
fsid                = 2b0dc3de-c683-4ccd-b2c5-4a26c9fb964d
mon initial members = a
mon host            = [v2:10.107.169.17:3300,v1:10.107.169.17:6789]

[client.admin]
keyring = /var/lib/rook/enc/client.admin.keyring
2022-11-21 13:53:11.360993 I | cephosd: discovering hardware
2022-11-21 13:53:11.361003 D | exec: Running command: lsblk /mnt/set1-data-1s9k9w --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE
2022-11-21 13:53:11.363890 D | sys: lsblk output: "SIZE=\"10737418240\" ROTA=\"0\" RO=\"0\" TYPE=\"disk\" PKNAME=\"\" NAME=\"/dev/rbd0\" KNAME=\"/dev/rbd0\" MOUNTPOINT=\"\" FSTYPE=\"\""
2022-11-21 13:53:11.363933 D | exec: Running command: sgdisk --print /mnt/set1-data-1s9k9w
2022-11-21 13:53:11.366033 D | exec: Running command: udevadm info --query=property /dev/rbd0
2022-11-21 13:53:11.370839 D | sys: udevadm info output: "DEVNAME=/dev/rbd0\nDEVPATH=/devices/rbd/0/block/rbd0\nDEVTYPE=disk\nMAJOR=251\nMINOR=0\nSUBSYSTEM=block\nTAGS=:systemd:\nUSEC_INITIALIZED=2068042192"
2022-11-21 13:53:11.370862 I | cephosd: creating and starting the osds
2022-11-21 13:53:11.370968 D | cephosd: desiredDevices are [{Name:/mnt/set1-data-1s9k9w OSDsPerDevice:1 MetadataDevice: DatabaseSizeMB:0 DeviceClass: InitialWeight: IsFilter:false IsDevicePathFilter:false}]
2022-11-21 13:53:11.370976 D | cephosd: context.Devices are:
2022-11-21 13:53:11.371001 D | cephosd: &{Name:/mnt/set1-data-1s9k9w Parent: HasChildren:false DevLinks: Size:10737418240 UUID:401766c2-db89-40be-a77e-82546497753b Serial: Type:data Rotational:false Readonly:false Partitions:[] Filesystem: Mountpoint: Vendor: Model: WWN: WWNVendorExtension: Empty:false CephVolumeData: RealPath:/dev/rbd0 KernelName:rbd0 Encrypted:false}
2022-11-21 13:53:11.371021 I | cephosd: old lsblk can't detect bluestore signature, so try to detect here
2022-11-21 13:53:11.371079 D | exec: Running command: cryptsetup luksDump /mnt/set1-data-1s9k9w
2022-11-21 13:53:11.392270 D | exec: Running command: lsblk --noheadings --path --list --output NAME /mnt/set1-data-1s9k9w
2022-11-21 13:53:11.394738 I | cephosd: skipping device "/mnt/set1-data-1s9k9w": failed to detect if there is already an osd. failed to find the encrypted block device for "/mnt/set1-data-1s9k9w", not opened?.
2022-11-21 13:53:11.403057 I | cephosd: configuring osd devices: {"Entries":{}}
2022-11-21 13:53:11.403331 I | cephosd: no new devices to configure. returning devices already configured with ceph-volume.
2022-11-21 13:53:11.403443 D | exec: Running command: pvdisplay -C -o lvpath --noheadings /mnt/set1-data-1s9k9w
2022-11-21 13:53:11.482979 W | cephosd: failed to retrieve logical volume path for "/mnt/set1-data-1s9k9w". exit status 5
2022-11-21 13:53:11.483049 D | exec: Running command: lsblk /mnt/set1-data-1s9k9w --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE
2022-11-21 13:53:11.489011 D | sys: lsblk output: "SIZE=\"10737418240\" ROTA=\"0\" RO=\"0\" TYPE=\"disk\" PKNAME=\"\" NAME=\"/dev/rbd0\" KNAME=\"/dev/rbd0\" MOUNTPOINT=\"\" FSTYPE=\"\""
2022-11-21 13:53:11.489308 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list  --format json
2022-11-21 13:53:11.901643 D | cephosd: {}
2022-11-21 13:53:11.901695 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
2022-11-21 13:53:11.901705 D | exec: Running command: cryptsetup luksDump /mnt/set1-data-1s9k9w
2022-11-21 13:53:11.921288 D | exec: Running command: lsblk --noheadings --path --list --output NAME /mnt/set1-data-1s9k9w
2022-11-21 13:53:11.935549 C | rookcmd: failed to configure devices: failed to get device already provisioned by ceph-volume raw: failed to find the encrypted block device for "/mnt/set1-data-1s9k9w", not opened?
After fix:
2022-11-21 13:41:14.216619 I | cephcmd: desired devices to configure osds: [{Name:/mnt/set1-data-1s9k9w OSDsPerDevice:1 MetadataDevice: DatabaseSizeMB:0 DeviceClass: InitialWeight: IsFilter:false IsDevicePathFilter:false}]
2022-11-21 13:41:14.219440 I | rookcmd: starting Rook v1.8.0-alpha.0.1391.g48a3879ca with arguments '/rook/rook ceph osd provision'
2022-11-21 13:41:14.219458 I | rookcmd: flag values: --cluster-id=6ec0a6ea-99a5-4281-a70f-322e2ecdd666, --cluster-name=my-cluster, --data-device-filter=, --data-device-path-filter=, --data-devices=[{"id":"/mnt/set1-data-1s9k9w","storeConfig":{"osdsPerDevice":1}}], --encrypted-device=true, --force-format=false, --help=false, --location=, --log-level=DEBUG, --metadata-device=, --node-name=set1-data-1s9k9w, --operator-image=, --osd-crush-device-class=, --osd-crush-initial-weight=, --osd-database-size=0, --osd-wal-size=576, --osds-per-device=1, --pvc-backed-osd=true, --service-account=
2022-11-21 13:41:14.219464 I | op-mon: parsing mon endpoints: a=10.107.169.17:6789
2022-11-21 13:41:14.226495 I | op-osd: CRUSH location=root=default host=set1-data-1s9k9w
2022-11-21 13:41:14.226519 I | cephcmd: crush location of osd: root=default host=set1-data-1s9k9w
2022-11-21 13:41:14.226526 D | cephosd: encryption configuration detecting, populating kek to an env variable
2022-11-21 13:41:14.226542 D | cephosd: cluster-wide encryption is enabled with kubernetes secrets and the kek is attached to the provision env spec
2022-11-21 13:41:14.226546 D | exec: Running command: dmsetup version
2022-11-21 13:41:14.229333 I | cephosd: Library version:   1.02.181-RHEL8 (2021-10-20)
Driver version:    4.43.0
2022-11-21 13:41:14.237855 D | cephclient: No ceph configuration override to merge as "rook-config-override" configmap is empty
2022-11-21 13:41:14.237901 I | cephclient: writing config file /var/lib/rook/enc/enc.config
2022-11-21 13:41:14.238068 I | cephclient: generated admin config in /var/lib/rook/enc
2022-11-21 13:41:14.238220 D | cephclient: config file @ /etc/ceph/ceph.conf:
[global]
fsid                = 2b0dc3de-c683-4ccd-b2c5-4a26c9fb964d
mon initial members = a
mon host            = [v2:10.107.169.17:3300,v1:10.107.169.17:6789]

[client.admin]
keyring = /var/lib/rook/enc/client.admin.keyring
2022-11-21 13:41:14.238256 I | cephosd: discovering hardware
2022-11-21 13:41:14.238264 D | exec: Running command: lsblk /mnt/set1-data-1s9k9w --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE
2022-11-21 13:41:14.245250 D | sys: lsblk output: "SIZE=\"10737418240\" ROTA=\"0\" RO=\"0\" TYPE=\"disk\" PKNAME=\"\" NAME=\"/dev/rbd0\" KNAME=\"/dev/rbd0\" MOUNTPOINT=\"\" FSTYPE=\"\""
2022-11-21 13:41:14.245382 D | exec: Running command: sgdisk --print /mnt/set1-data-1s9k9w
2022-11-21 13:41:14.248506 D | exec: Running command: udevadm info --query=property /dev/rbd0
2022-11-21 13:41:14.253954 D | sys: udevadm info output: "DEVNAME=/dev/rbd0\nDEVPATH=/devices/rbd/0/block/rbd0\nDEVTYPE=disk\nMAJOR=251\nMINOR=0\nSUBSYSTEM=block\nTAGS=:systemd:\nUSEC_INITIALIZED=2068042192"
2022-11-21 13:41:14.253977 I | cephosd: creating and starting the osds
2022-11-21 13:41:14.253991 D | cephosd: desiredDevices are [{Name:/mnt/set1-data-1s9k9w OSDsPerDevice:1 MetadataDevice: DatabaseSizeMB:0 DeviceClass: InitialWeight: IsFilter:false IsDevicePathFilter:false}]
2022-11-21 13:41:14.253994 D | cephosd: context.Devices are:
2022-11-21 13:41:14.254029 D | cephosd: &{Name:/mnt/set1-data-1s9k9w Parent: HasChildren:false DevLinks: Size:10737418240 UUID:51bebcba-2dad-45d6-a6ec-7a88f35a822c Serial: Type:data Rotational:false Readonly:false Partitions:[] Filesystem: Mountpoint: Vendor: Model: WWN: WWNVendorExtension: Empty:false CephVolumeData: RealPath:/dev/rbd0 KernelName:rbd0 Encrypted:false}
2022-11-21 13:41:14.254043 I | cephosd: old lsblk can't detect bluestore signature, so try to detect here
2022-11-21 13:41:14.254070 D | exec: Running command: cryptsetup luksDump /mnt/set1-data-1s9k9w
2022-11-21 13:41:14.265422 D | exec: Running command: lsblk --noheadings --path --list --output NAME /mnt/set1-data-1s9k9w
2022-11-21 13:41:14.267579 D | exec: Running command: lsblk --noheadings --output TYPE /dev/mapper/set1-data-1s9k9w-block-dmcrypt
2022-11-21 13:41:14.270470 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /dev/mapper/set1-data-1s9k9w-block-dmcrypt --format json
2022-11-21 13:41:14.612073 D | cephosd: {
    "959b244d-043e-4603-9e89-20067b85888d": {
        "ceph_fsid": "2b0dc3de-c683-4ccd-b2c5-4a26c9fb964d",
        "device": "/dev/mapper/set1-data-1s9k9w-block-dmcrypt",
        "osd_id": 0,
        "osd_uuid": "959b244d-043e-4603-9e89-20067b85888d",
        "type": "bluestore"
    }
}
2022-11-21 13:41:14.612701 D | exec: Running command: cryptsetup luksDump /mnt/set1-data-1s9k9w
2022-11-21 13:41:14.632525 D | exec: Running command: cryptsetup --verbose luksClose /dev/mapper/set1-data-1s9k9w-block-dmcrypt
2022-11-21 13:41:14.652763 I | cephosd: dm version:
Command successful.
2022-11-21 13:41:14.652784 I | cephosd: 1 ceph-volume raw osd devices configured on this node
2022-11-21 13:41:14.652790 I | cephosd: skipping device "/mnt/set1-data-1s9k9w": already in use by a raw OSD, no need to reconfigure.
2022-11-21 13:41:14.659073 I | cephosd: configuring osd devices: {"Entries":{}}
2022-11-21 13:41:14.659089 I | cephosd: no new devices to configure. returning devices already configured with ceph-volume.
2022-11-21 13:41:14.659095 D | exec: Running command: pvdisplay -C -o lvpath --noheadings /mnt/set1-data-1s9k9w
2022-11-21 13:41:14.721583 W | cephosd: failed to retrieve logical volume path for "/mnt/set1-data-1s9k9w". exit status 5
2022-11-21 13:41:14.721605 D | exec: Running command: lsblk /mnt/set1-data-1s9k9w --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE
2022-11-21 13:41:14.724336 D | sys: lsblk output: "SIZE=\"10737418240\" ROTA=\"0\" RO=\"0\" TYPE=\"disk\" PKNAME=\"\" NAME=\"/dev/rbd0\" KNAME=\"/dev/rbd0\" MOUNTPOINT=\"\" FSTYPE=\"\""
2022-11-21 13:41:14.724364 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list  --format json
2022-11-21 13:41:15.104965 D | cephosd: {}
2022-11-21 13:41:15.105013 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
2022-11-21 13:41:15.105022 D | exec: Running command: cryptsetup luksDump /mnt/set1-data-1s9k9w
2022-11-21 13:41:15.111933 D | exec: Running command: lsblk --noheadings --path --list --output NAME /mnt/set1-data-1s9k9w
2022-11-21 13:41:15.113515 D | cephosd: encrypted block device "/mnt/set1-data-1s9k9w" is not open, opening it now
2022-11-21 13:41:15.113535 D | exec: Running command: cryptsetup luksOpen --verbose --allow-discards /mnt/set1-data-1s9k9w set1-data-1s9k9w-block-dmcrypt
2022-11-21 13:41:16.628777 D | exec: Key slot 0 unlocked.
2022-11-21 13:41:16.628795 D | exec: Command successful.
2022-11-21 13:41:16.630546 D | exec: Running command: lsblk --noheadings --output TYPE /dev/mapper/set1-data-1s9k9w-block-dmcrypt
2022-11-21 13:41:16.632304 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /dev/mapper/set1-data-1s9k9w-block-dmcrypt --format json
2022-11-21 13:41:16.904036 D | cephosd: {
    "959b244d-043e-4603-9e89-20067b85888d": {
        "ceph_fsid": "2b0dc3de-c683-4ccd-b2c5-4a26c9fb964d",
        "device": "/dev/mapper/set1-data-1s9k9w-block-dmcrypt",
        "osd_id": 0,
        "osd_uuid": "959b244d-043e-4603-9e89-20067b85888d",
        "type": "bluestore"
    }
}
2022-11-21 13:41:16.904087 D | exec: Running command: lsblk /dev/mapper/set1-data-1s9k9w-block-dmcrypt --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE
2022-11-21 13:41:16.906371 D | sys: lsblk output: "SIZE=\"10720641024\" ROTA=\"0\" RO=\"0\" TYPE=\"crypt\" PKNAME=\"\" NAME=\"/dev/mapper/set1-data-1s9k9w-block-dmcrypt\" KNAME=\"/dev/dm-2\" MOUNTPOINT=\"\" FSTYPE=\"ceph_bluestore\""
2022-11-21 13:41:16.906393 I | cephosd: setting device class "ssd" for device "/dev/mapper/set1-data-1s9k9w-block-dmcrypt"
2022-11-21 13:41:16.906401 D | exec: Running command: cryptsetup luksDump /mnt/set1-data-1s9k9w
2022-11-21 13:41:16.913197 D | exec: Running command: cryptsetup --verbose luksClose /dev/mapper/set1-data-1s9k9w-block-dmcrypt
2022-11-21 13:41:16.925925 I | cephosd: dm version:
Command successful.
2022-11-21 13:41:16.925950 I | cephosd: 1 ceph-volume raw osd devices configured on this node
2022-11-21 13:41:16.925981 I | cephosd: devices = [{ID:0 Cluster:ceph UUID:959b244d-043e-4603-9e89-20067b85888d DevicePartUUID: DeviceClass:ssd BlockPath:/dev/mapper/set1-data-1s9k9w-block-dmcrypt MetadataPath: WalPath: SkipLVRelease:true Location:root=default host=set1-data-1s9k9w LVBackedPV:false CVMode:raw Store:bluestore TopologyAffinity: Encrypted:true}]

@Rakshith-R Rakshith-R changed the title osd: re-open encrypted disk during osd-prepare-job if closed. osd: re-open encrypted disk during osd-prepare-job if closed Nov 21, 2022
Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include unit tests to make sure the stdin is working properly. Can we get a string that simulates a LUKS device?

// Let's re-open the block to re-hydrate the OSDInfo.
logger.Debugf("encrypted block device %q is not open, opening it now", block)
passphrase := os.Getenv(oposd.CephVolumeEncryptedKeyEnvVarName)
target := fmt.Sprintf("%s-%s", os.Getenv(oposd.PVCNameEnvVarName), oposd.DmcryptBlockType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about this. I'm not sure how strictly guaranteed the behavior we are relying on is. I wonder if there are kubernetes versions or variants where the pvc name is not part of the dm name.

I think we should be able to use the encryptedBlock variable here, no? That should contain -dmcrypt according to the lines above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about this. I'm not sure how strictly guaranteed the behavior we are relying on is. I wonder if there are kubernetes versions or variants where the pvc name is not part of the dm name.

This target name can be anything since we'll be opening luks with this target explicitly and call c-v on this.

I think we should be able to use the encryptedBlock variable here, no? That should contain -dmcrypt according to the lines above.

c-v prefers to have /dev/mapper/<name> format, that's why i kept both the variables separate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. target is what luksOpen uses for its name parameter and is created by opening the disk. Thanks :)

Comment on lines 163 to 173
func openLUKS(context *clusterd.Context, disk, target, passphrase string) error {
args := []string{"luksOpen", "--verbose", "--allow-discards", disk, target}
err := context.Executor.ExecuteCommandWithStdin(cryptsetupBinary, &passphrase, args...)
if err != nil {
logger.Errorf("failed to open LUKS device %q", disk)
return err
}

return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice this doesn't have a timeout. For the prepare job, I know we have been more lax on resource requests/timeouts in the past. IMO, it would still be good to have a timeout just to make sure we can continue with provisioning if a disk is slow to respond or some other reason. For LUKS, I think 2 1/2 minutes (150 seconds) should be good given that a device should normally open within 30 seconds.

@travisn wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why go 5 times the 30s? Either way the timeout seems plenty long so I don't have a strong opinion other than it seems like 30s would be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why go 5 times the 30s? Either way the timeout seems plenty long so I don't have a strong opinion other than it seems like 30s would be sufficient.

Set it to 30, I think that's sufficient.

Copy link
Member

@BlaineEXE BlaineEXE Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a higher value so that we don't get flakiness in our tests because of the value. A lot of times, when there are resource-constrainted environments, what is "supposed" to happen timing-wise takes much longer. I use 5x as a rule of thumb because when I've used 3x, there is often still flakiness. Maybe 120 seconds is good, but I would really suggest using more than 30 and even more than 90.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some digging. Luks uses a static 2 seconds to open a disk with a key, which is strange that it's static, but it apparently is so. There can be 8 key slots, which means it could take 16 total seconds for each unlock attempt with each key. Plus any overhead for CPU time. I believe the normal case is that the unlock should take less than 3 seconds, but let's not risk create a flaky system for our CI or for corner-case users by using too aggressive of a timeout. I still think having at least a 90 second timeout is reasonable.

The most common case will be that it finishes successfully in ~2 seconds. In the event of some corruption, it will take 1.5 minutes. If I/O to the disk is stalling, it will take 1.5 minutes for each disk to fail. For our CI, I/O stalling seems like it's not uncommon, and there are many users who try to run on R-Pis and other small systems.

On a different angle, if a server with 60 disks attached to a SAS controller has a controller failure such that I/Os stall indefinitely without a noticeable failure, the prepare job won't fail until after 1.5 hours. That is quite a while, but IMO if a user has that many disks on a node, they likely expect installation and upgrades will take a fair amount of time, and 1.5 hours isn't a lot compared to a rolling upgrade that could take a week for large clusters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to 90,

@travisn travisn requested a review from leseb November 28, 2022 21:14
@Rakshith-R Rakshith-R force-pushed the fix-enc-osd-prepare-job branch 4 times, most recently from 92281df to b22605c Compare November 29, 2022 10:39
args := []string{"luksOpen", "--verbose", "--allow-discards", disk, target}
err := context.Executor.ExecuteCommandWithStdin(luksOpenCmdTimeOut, cryptsetupBinary, &passphrase, args...)
if err != nil {
logger.Errorf("failed to open LUKS device %q", disk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why both log and return err? The error is going to be surfaced eventually so just returning a wrapped error seems sufficient.

pkg/daemon/ceph/osd/volume.go Show resolved Hide resolved
// Let's re-open the block to re-hydrate the OSDInfo.
logger.Debugf("encrypted block device %q is not open, opening it now", block)
passphrase := os.Getenv(oposd.CephVolumeEncryptedKeyEnvVarName)
target := fmt.Sprintf("%s-%s", os.Getenv(oposd.PVCNameEnvVarName), oposd.DmcryptBlockType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have a helper for this somewhere in the OSD code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found, exported and using it now.

@Rakshith-R Rakshith-R force-pushed the fix-enc-osd-prepare-job branch 4 times, most recently from 3270c13 to 9909db9 Compare November 29, 2022 11:26
@Rakshith-R
Copy link
Member Author

Let's include unit tests to make sure the stdin is working properly.

Added some tests to simulate stdin and timeout.

Can we get a string that simulates a LUKS device?

I didn't get what you mean.

Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly nits

dmsetupBinary = "dmsetup"
cryptsetupBinary = "cryptsetup"
dmsetupBinary = "dmsetup"
luksOpenCmdTimeOut = 30 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite long for a timeout, I would think that 10sec is more than enough already.

@@ -160,6 +162,16 @@ func dumpLUKS(context *clusterd.Context, disk string) (string, error) {
return cryptsetupOut, nil
}

func openLUKS(context *clusterd.Context, disk, target, passphrase string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func openLUKS(context *clusterd.Context, disk, target, passphrase string) error {
func openEncryptedDevice(context *clusterd.Context, disk, target, passphrase string) error {

args := []string{"luksOpen", "--verbose", "--allow-discards", disk, target}
err := context.Executor.ExecuteCommandWithStdin(luksOpenCmdTimeOut, cryptsetupBinary, &passphrase, args...)
if err != nil {
return errors.Wrapf(err, "failed to open LUKS device %q", disk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.Wrapf(err, "failed to open LUKS device %q", disk)
return errors.Wrapf(err, "failed to open encrypted device %q", disk)

return nil, errors.Errorf("failed to find the encrypted block device for %q, not opened?", block)
}
// The encrypted block is not opened.
// The OSD deployment has been removed manually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the mention of the node reboot? If the deployment is removed manually the encrypted block is left open.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the mention of the node reboot? If the deployment is removed manually the encrypted block is left open.

This is not the only case, the encrypted block is closed when csi provisioned pvc is unmounted due to osd pod deletion too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not clear, please clarify the scenario with CSI. If we have more than the reboot then they should all be listed in the code as comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not clear, please clarify the scenario with CSI. If we have more than the reboot then they should all be listed in the code as comments.

I dont think we can assume that a encrypted device opened by a pod remains open after pod is deleted.

IMO The case of rook using disks directly on node/ intree driver provisioned pvc leaving the encrypted device open when
the pod rebooted is the corner case. I am thinking we should go ahead with the assumption the encrypted device is no longer open once the pod, be it osd-prepare-job or osd, is deleted.

Does this make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not clear, please clarify the scenario with CSI. If we have more than the reboot then they should all be listed in the code as comments.

I dont think we can assume that a encrypted device opened by a pod remains open after pod is deleted.

Why? What would close it then?

IMO The case of rook using disks directly on node/ intree driver provisioned pvc leaving the encrypted device open when the pod rebooted is the corner case. I am thinking we should go ahead with the assumption the encrypted device is no longer open once the pod, be it osd-prepare-job or osd, is deleted.

Even in the PVC case where Rook consumes a PVC for OSD disk, if that PVC is detached the encrypted block is still open and available but broken. We've had cases like this where the underlying block device disappears.

Does this make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍🏻 I don't mean to argue too much too, if you saw it then I trust you. Also as a reference see this code https://github.com/leseb/rook/blob/61279e98201d5d80090f57433a21e297b2f5efb5/pkg/operator/ceph/cluster/osd/spec.go#L172-L212. We still have cases where the PV block is detached and the encrypted block is still around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay 👍
one example is as follows:

This is how it looks on a node with functioning osd

[rakshith@fedora ~]$ oc debug node/ip-10-0-165-109.us-east-2.compute.internal
Starting pod/ip-10-0-165-109us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.165.109
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop1         7:1    0   100G  0 loop  
rbd0        252:0    0    40G  0 disk  /var/lib/kubelet/pods/d3adec98-e695-4ba3-a110-ff654c5a5f5c/volumes/kubernetes.io~csi/pvc-a3508c23-c949-43bd-bbfe-e00b52
nvme0n1     259:0    0   120G  0 disk  
|-nvme0n1p1 259:1    0     1M  0 part  
|-nvme0n1p2 259:2    0   127M  0 part  
|-nvme0n1p3 259:3    0   384M  0 part  /boot
`-nvme0n1p4 259:4    0 119.5G  0 part  /sysroot
nvme1n1     259:5    0    50G  0 disk  /var/lib/kubelet/pods/2e1983d0-25cf-436c-b289-ce7acc6625c4/volumes/kubernetes.io~csi/pvc-0a2fa1bb-6af1-4ca1-a624-12d9f0
nvme2n1     259:6    0   100G  0 disk  
`-ocs-deviceset-0-data-0cljr5-block-dmcrypt
            253:0    0   100G  0 crypt 

This is how it looks when osd removal is done

[rakshith@fedora ~]$ oc debug node/ip-10-0-152-142.us-east-2.compute.internal
Starting pod/ip-10-0-152-142us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.152.142
If you don't see a command prompt, try pressing enter.
sh-4.4# 
sh-4.4# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
rbd0        252:0    0    40G  0 disk /host/var/lib/kubelet/pods/31cdabdd-9f4a-4490-b8a2-cd9a9e1f1458/volumes/kubernetes.io~csi/pvc-76c14ea6-8664-4116-b86e-80a
nvme0n1     259:0    0   120G  0 disk 
|-nvme0n1p1 259:1    0     1M  0 part 
|-nvme0n1p2 259:2    0   127M  0 part 
|-nvme0n1p3 259:3    0   384M  0 part /host/boot
`-nvme0n1p4 259:4    0 119.5G  0 part /host/sysroot
nvme1n1     259:5    0    50G  0 disk /host/var/lib/kubelet/pods/46933254-64a2-43a9-a5b5-d4904ccc2025/volumes/kubernetes.io~csi/pvc-95b398fd-3d76-456d-903f-8bc

When osd-prepare job pod starts

sh-4.4# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0         7:0    0   100G  0 loop 
rbd0        252:0    0    40G  0 disk /var/lib/kubelet/pods/31cdabdd-9f4a-4490-b8a2-cd9a9e1f1458/volumes/kubernetes.io~csi/pvc-76c14ea6-8664-4116-b86e-80abcf33
nvme0n1     259:0    0   120G  0 disk 
|-nvme0n1p1 259:1    0     1M  0 part 
|-nvme0n1p2 259:2    0   127M  0 part 
|-nvme0n1p3 259:3    0   384M  0 part /boot
`-nvme0n1p4 259:4    0 119.5G  0 part /sysroot
nvme1n1     259:5    0    50G  0 disk /var/lib/kubelet/pods/46933254-64a2-43a9-a5b5-d4904ccc2025/volumes/kubernetes.io~csi/pvc-95b398fd-3d76-456d-903f-8bca598f
nvme3n2     259:7    0   100G  0 disk 
sh-4.4# exit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but ocs-deviceset-0-data-0cljr5-block-dmcrypt is copied in /var/lib/ceph/osd/ceph-0/block, this is the file the OSD uses to run and I believe this file is still there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but ocs-deviceset-0-data-0cljr5-block-dmcrypt is copied in /var/lib/ceph/osd/ceph-0/block, this is the file the OSD uses to run and I believe this file is still there.

that is a file inside the container right?
it goes away with the deletion of osd pod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is HostPath.

target := oposd.EncryptionDMName(pvcName, oposd.DmcryptBlockType)
err = openLUKS(context, block, target, passphrase)
if err != nil {
return nil, errors.Errorf("failed to open encrypted block device %q on %q", block, target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use errors.Wrapf() and report err.

@Rakshith-R Rakshith-R force-pushed the fix-enc-osd-prepare-job branch 2 times, most recently from 09af2ac to d8d6c66 Compare November 29, 2022 13:06
Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

logger.Debugf("encrypted block device %q is not open, opening it now", block)
passphrase := os.Getenv(oposd.CephVolumeEncryptedKeyEnvVarName)
if passphrase == "" {
return nil, errors.Errorf("encryption passphrase is empty in env var %q", oposd.CephVolumeEncryptedKeyEnvVarName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To consider in a future PR, we should write the secret to a file instead of using an env var, as mentioned in #10994.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should do that in this PR so that we don't risk forgetting to fix it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That requires changes in quite a lot of places actually.
It is better to do it in a separate future pr.

This commit implements this corner case during osd-prepare job.
```
The encrypted block is not opened, this is an extreme corner case
The OSD deployment has been removed manually AND the node rebooted
So we need to re-open the block to re-hydrate the OSDInfo.

Handling this case would mean, writing the encryption key on a
temporary file, then call luksOpen to open the encrypted block and
then call ceph-volume to list against the opened encrypted block.
We don't implement this, yet and return an error.
```
When underlying PVC for osd are CSI provisioned, the encrypted device
is closed when PVC is unmounted due to osd pod being deleted.
Therefore, this may occur more frequently and needs to be handled.
This commit implements the fix for the same.

Signed-off-by: Rakshith R <rar@redhat.com>
@BlaineEXE BlaineEXE merged commit 4883a6e into rook:master Dec 1, 2022
mergify bot added a commit that referenced this pull request Dec 1, 2022
osd: re-open encrypted disk during osd-prepare-job if closed (backport #11338)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants