New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: handle removal of encrypted osd deployment #9434
Conversation
b5805ee
to
f2c0384
Compare
As far as code, the logic here is a little hard to follow, but I trust that it's the right thing to do here. I am having a little bit of trouble trying to understand the underlying case and mode of failure. Notably, I'm not quite following the below text from the commit message:
|
logger.Infof("failed to get devices already provisioned by ceph-volume. %v", err) | ||
} | ||
osds = append(osds, lvmOsds...) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this to be else
? Versus just continuing on after the if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just felt that the else
really breaks the statements, it feels easier to read and parse that large code block.
Also, a bit of a complaint: what is the good of ceph-volume if we have to do utterly so much work to work around it and have so many special cases? |
Let's say |
The problem is that the raw/pvc case in c-v has been left behind a little bit and the encryption piece is only used by Rook as far as I can tell. Also, it's hard for c-v to know all the details we do in Rook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we seen this issue happen in production? Seems like it's only a test scenario where someone intentionally deleted an osd deployment and we need to go create it again.
A couple small suggestions, and approving since i'll be out this week...
This is handling a tricky scenario where the OSD deployment is manually removed and the OSD never reconvers. This is unlikely to happen, but still OSD should be able to run after that action. Essentially after a manual deletion, we need to run the prepare job again to re-hydrate the OSD information so that the OSD deployment can be deployed. On encryption, it is a little bit tricky since ceph-volume list again the main block won't return anything, so we need to target the encrypted block to list. There is another case this PR does not handle, which is the removal of the OSD deployment and then the node is restarted. This means that the encrypted container is not opened anymore. However, opening it requires more work like writing the key on the filesystem (if not coming from the Kubernete secret, eg,. KMS vault) and then run luksOpen. This is an extreme corner case probably not worth worrying about for now. Signed-off-by: Sébastien Han <seb@redhat.com>
Comments have been addressed. |
osd: handle removal of encrypted osd deployment (backport #9434)
This is handling a tricky scenario where the OSD deployment is manually
removed and the OSD never reconvers. This is unlikely to happen, but
still OSD should be able to run after that action. Essentially after a
manual deletion, we need to run the prepare job again to re-hydrate the
OSD information so that the OSD deployment can be deployed.
On encryption, it is a little bit tricky since ceph-volume list again
the main block won't return anything, so we need to target the encrypted
block to list.
There is another case this PR does not handle, which is the removal of
the OSD deployment and then the node is restarted. This means that the
encrypted container is not opened anymore. However, opening it requires
more work like writing the key on the filesystem (if not coming from the
Kubernete secret, eg,. KMS vault) and then run luksOpen. This is an
extreme corner case probably not worth worrying about for now.
Signed-off-by: Sébastien Han seb@redhat.com
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #
Checklist:
make codegen
) has been run to update object specifications, if necessary.