Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: osdSafeToDestroy check should be after osd out #11138

Merged
merged 1 commit into from Oct 18, 2022

Conversation

subhamkrai
Copy link
Contributor

@subhamkrai subhamkrai commented Oct 11, 2022

Description of your changes:
currently, osdSafeToDestroy check is done before
marking osd out which prevents ceph from rebalancing. With these changes, we are moving check for
osdSafeToDestroy after osd is marked out.

Signed-off-by: subhamkrai srai@redhat.com

Which issue is resolved by this Pull Request:
Resolves # https://bugzilla.redhat.com/show_bug.cgi?id=2128966

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide).
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

currently, osdSafeToDestroy check is done before
marking osd out which prevents ceph from rebalancing.
With this changes, we are moving check for
osdSafeToDestroy after osd is marked out.

Signed-off-by: subhamkrai <srai@redhat.com>
if err != nil {
// If we want to force remove the OSD and there was an error let's break outside of
// the loop and proceed with the OSD removal
if forceOSDRemoval {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but why do we have forceOSDRemoval check after checking OsdSafeToDestroy ?
If forceOSDRemoval is true, then can OsdSafeToDestroy check be skipped completely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that was the intention for adding forceOsdRemoval

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just continuing the offline discussion I had with @sp98 here,

we should skip OsdSafeToDestroy check when forceOSDRemoval passed since the purpose of the forceOSDRemoval flag is that that we don't care of data loss and want to remove osd forcefully. But, if we pass more than one osd id if the removal job that could lead to data loss(#9230 more context here). So, what should be the right way to go with @travisn @BlaineEXE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking personally, I don't think we care if removal is maximally efficient. It just needs to be correct. If there is a possibility that removing the OSD will be safer to remove non-forcefully first, then what exists now is best. If what exists now isn't wrong, maybe we should just leave the code as is.

One aspect of reality that I see many people miss when working with Rook is that Rook is not part of any data plane. Even some of the least efficient code that Rook implements pales compared to the milliseconds that it takes to send/receive network packets to/from the K8s API server or Ceph. In my estimation, focusing on optimizing Rook's code is a wasted effort 95% of the time. Instead, I think Rook will be a better project for users if we prioritize code that is easy to read (and maintain) and is well-tested.

There are lots of articles about "premature optimization," but I like this discussion quite a bit. It talks about premature versus unnecessary optimization and how optimization compares today vs in the 60s/70s when the idea of "premature optimization" was put into a book. https://softwareengineering.stackexchange.com/questions/80084/is-premature-optimization-really-the-root-of-all-evil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it's still valid to check if it's safe to destroy so we can log whether it was safe to destroy, even if we continue and remove it anyway. Agreed it's not a question of code efficiency. In this case logging the results is important to understand what happened. If they do have data loss they can look back at the log and realized they caused a problem by forcing osd removal.

if err != nil {
// If we want to force remove the OSD and there was an error let's break outside of
// the loop and proceed with the OSD removal
if forceOSDRemoval {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it's still valid to check if it's safe to destroy so we can log whether it was safe to destroy, even if we continue and remove it anyway. Agreed it's not a question of code efficiency. In this case logging the results is important to understand what happened. If they do have data loss they can look back at the log and realized they caused a problem by forcing osd removal.

@subhamkrai
Copy link
Contributor Author

testing result:

2022-10-18 11:31:38.884266 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2022-10-18 11:31:39.144571 I | cephosd: validating status of osd.0
2022-10-18 11:31:39.144619 I | cephosd: osd.0 is marked 'DOWN'
2022-10-18 11:31:39.144657 D | exec: Running command: ceph osd find 0 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2022-10-18 11:31:39.418838 I | cephosd: subham: this local testing in 2nd
2022-10-18 11:31:39.418887 I | cephosd: marking osd.0 out
2022-10-18 11:31:39.418925 D | exec: Running command: ceph osd out osd.0 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2022-10-18 11:31:40.206814 D | exec: Running command: ceph osd safe-to-destroy 0 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2022-10-18 11:31:40.472056 I | cephosd: osd.0 is NOT be ok to destroy but force removal is enabled so proceeding with removal
2022-10-18 11:31:40.479337 I | cephosd: removing the OSD deployment "rook-ceph-osd-0"

we can see that first, it is making osd out and then checking osd Safe to destroy

@travisn travisn merged commit 429d209 into rook:master Oct 18, 2022
@subhamkrai subhamkrai deleted the update-osd-removal-job branch October 18, 2022 13:24
mergify bot added a commit that referenced this pull request Oct 18, 2022
osd: osdSafeToDestroy check should be after osd out (backport #11138)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants