Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph: reconcile osd pdb if allowed disruption is 0 #8698
ceph: reconcile osd pdb if allowed disruption is 0 #8698
Changes from all commits
7480f6b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an OSD is crashlooping, but the PGs are healthy because they have been backfilled, this will cause the pdb reconcile to continue every 30s until the OSD stops crashing, right? Or will this be skipped if the PGs are healthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When OSDs are crashlooping and PGs are not healthy, then we won't be seeing the default PDB (osdPDBAppName). Only blocking PDBs would be there. This condition won't hit because of the error in line 412
Its a good point though. I'll test it a bit more and confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So OSD goes into CLBO and pgs become degraded. When pgs become active again after re-balancing, we end up with following:
Observe
Allowed Disruptions
is 0 because the OSD is still down. PDB reconciles because this is hitNow user purges the OSD for the CLBO OSD pod. PBDs move back to the desired state again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the only way to return to the state of clean OSD PVCs is to remove/replace the crashing OSD. And from your link, the PDBs will continue reconciling every 15s? That might be worth looking into for a separate PR so it isn't so frequent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thats true. Only way is to remove the crashed OSD. This behavior would still have been the same even before this PR. I'll a take a look (in a separate PR) if we handle this situation better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sp98 please open up an issue to track this. Thanks