New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: osdSafeToDestroy check should be after osd out #11138
Conversation
currently, osdSafeToDestroy check is done before marking osd out which prevents ceph from rebalancing. With this changes, we are moving check for osdSafeToDestroy after osd is marked out. Signed-off-by: subhamkrai <srai@redhat.com>
if err != nil { | ||
// If we want to force remove the OSD and there was an error let's break outside of | ||
// the loop and proceed with the OSD removal | ||
if forceOSDRemoval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but why do we have forceOSDRemoval
check after checking OsdSafeToDestroy
?
If forceOSDRemoval
is true, then can OsdSafeToDestroy
check be skipped completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, that was the intention for adding forceOsdRemoval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just continuing the offline discussion I had with @sp98 here,
we should skip OsdSafeToDestroy
check when forceOSDRemoval
passed since the purpose of the forceOSDRemoval
flag is that that we don't care of data loss and want to remove osd forcefully. But, if we pass more than one osd id if the removal job that could lead to data loss(#9230 more context here). So, what should be the right way to go with @travisn @BlaineEXE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking personally, I don't think we care if removal is maximally efficient. It just needs to be correct. If there is a possibility that removing the OSD will be safer to remove non-forcefully first, then what exists now is best. If what exists now isn't wrong, maybe we should just leave the code as is.
One aspect of reality that I see many people miss when working with Rook is that Rook is not part of any data plane. Even some of the least efficient code that Rook implements pales compared to the milliseconds that it takes to send/receive network packets to/from the K8s API server or Ceph. In my estimation, focusing on optimizing Rook's code is a wasted effort 95% of the time. Instead, I think Rook will be a better project for users if we prioritize code that is easy to read (and maintain) and is well-tested.
There are lots of articles about "premature optimization," but I like this discussion quite a bit. It talks about premature versus unnecessary optimization and how optimization compares today vs in the 60s/70s when the idea of "premature optimization" was put into a book. https://softwareengineering.stackexchange.com/questions/80084/is-premature-optimization-really-the-root-of-all-evil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it's still valid to check if it's safe to destroy so we can log whether it was safe to destroy, even if we continue and remove it anyway. Agreed it's not a question of code efficiency. In this case logging the results is important to understand what happened. If they do have data loss they can look back at the log and realized they caused a problem by forcing osd removal.
if err != nil { | ||
// If we want to force remove the OSD and there was an error let's break outside of | ||
// the loop and proceed with the OSD removal | ||
if forceOSDRemoval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it's still valid to check if it's safe to destroy so we can log whether it was safe to destroy, even if we continue and remove it anyway. Agreed it's not a question of code efficiency. In this case logging the results is important to understand what happened. If they do have data loss they can look back at the log and realized they caused a problem by forcing osd removal.
testing result:
we can see that first, it is making osd out and then checking |
osd: osdSafeToDestroy check should be after osd out (backport #11138)
Description of your changes:
currently, osdSafeToDestroy check is done before
marking osd out which prevents ceph from rebalancing. With these changes, we are moving check for
osdSafeToDestroy after osd is marked out.
Signed-off-by: subhamkrai srai@redhat.com
Which issue is resolved by this Pull Request:
Resolves # https://bugzilla.redhat.com/show_bug.cgi?id=2128966
Checklist:
skip-ci
on the PR.