New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add doc to recover from pod from lost node #7282
Conversation
FWIW, I tried to followed this format for commands. |
Testing with OCP 4.7.2 and OCS 4.7 RC3 shows that if a pod is on a OCP node that is powered off, AND the pod mounts a cephrbd volume, the recovery is as follows while the node is powered off:
There is no need to go through the |
@subhamkrai what's the status of this PR? |
waiting for @Madhu-1to review @travisn comments. |
Yes if the node is dead, RBD removes the watcher after a timeout of 5 minutes I think. we need to blacklist if the kubelet is dead but the application is still running on the node. sometime Blacklisting helps us to recover fast when the node is dead we don't need to wait for 8-10 minutes |
To summarize, the intention of this PR was to have a simple doc topic that describes how to recover from the node lost scenario. Would documenting these two steps be accurate and sufficient? If a Kuberenetes node is lost, all pods on that node that are consuming volumes will be in a stuck state until the admin takes action. When you are sure the node is lost:
|
Yes, @travisn sounds good. once we get more enhancements in CSI and kubernetes are we can update the contents on this doc. |
blacklisting 10.130.2.1:0/2076971174 until 2021-02-21T18:12:53.115328+0000 (3600 sec) | ||
``` | ||
|
||
After running the above command within a few minutes the pod will be running. **But don't forget to remove the above blacklist watcher** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At what point should they remove the watcher? When the node is confirmed dead? What if they don't remove the watcher? Does it never get removed automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the node is dead the watcher gets automatically removed. If the control plane is dead the application might be running, in that case, the watcher never gets removed then we need to blacklist it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we shouldn't tell them to un-blacklist, right? Or else we should be clear when to un-blacklist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so when the control plane is dead in this case only we need un-blacklist because in this only we are blocking watcher, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC from this comment, the rbd status
may not be reliable to show a watcher, or at least it doesn't guarantee the node is dead. So it seems like we need to recommend they blacklist a whole node until they can guarantee that node is dead. Otherwise, they risk corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a node might be disconnected from the kubernetes cluster. we can blacklist the node IP in the ceph cluster.
blacklisting 10.130.2.1:0/2076971174 until 2021-02-21T18:12:53.115328+0000 (3600 sec) | ||
``` | ||
|
||
After running the above command within a few minutes the pod will be running. **But don't forget to remove the above blacklist watcher** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the node is dead the watcher gets automatically removed. If the control plane is dead the application might be running, in that case, the watcher never gets removed then we need to blacklist it.
@travisn I and @ShyamsundarR were discussing fencing we see two problems here. Recently cephcsi pods in Rook have been updated to move away from host networking and I don't know we can/cannot hit this problem with multus networking. The below problem exists for both cephfs and RBD.
@ShyamsundarR did I missed anything? |
@Madhu-1 Let's open a new issue to track that issue related to host networking. Do we need to recommend re-enabling host networking until it's resolved? |
@travisn I need to do some more experimentation and will get back to you on this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments, but let's hold off merging until the comment about host networking is resolved in case it affects these instructions.
blacklisting 10.130.2.1:0/2076971174 until 2021-02-21T18:12:53.115328+0000 (3600 sec) | ||
``` | ||
|
||
After running the above command within a few minutes the pod will be running. **But don't forget to remove the above blacklist watcher** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we shouldn't tell them to un-blacklist, right? Or else we should be clear when to un-blacklist
this commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same volume`. Signed-off-by: subhamkrai <srai@redhat.com>
Lets hear from rbd expert @idryomov based on that we can decide what we can do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems prone to data corruption to me. I don't see anything guaranteeing that the node is really lost.
If there is a watcher listed in rbd status
output then (barring bugs on the OSD side) the node is definitely not lost. This is an easy case: blocklisting the client is the right thing to do.
The problem is that lack of a watch can't be used as an indicator of whether the client is really dead. Quoting my comment on another issue:
Please note that listing watchers and relying on that for telling what needs to be blocklisted is unreliable. An image may still be mapped somewhere (and be written to!) with no watchers in rbd status output.
And there are no watches in the CephFS case at all.
For this procedure to be safe, the node that the pod is being moved from needs to be STONITHed prior to force deleting the pod and allowing Kubernetes to reschedule it. Alternatively, if taking out the entire node is undesirable, something needs to keep track of Ceph entity addresses (10.130.2.1:0/2076971174
in the proposed change) and unconditionally blocklist based on that, without consulting rbd status
.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions. |
@subhamkrai what's the status of this PR? |
@Madhu-1 can you just confirm what changes are left to be done? Thanks |
Sorry, but I have lost my origin for this branch, created a new one with the same changes #8742. Thanks |
this commit adds the doc which has the manual
steps to recover from the specific scenario
like
on the lost node, the new pod can't mount the same volume
.Signed-off-by: subhamkrai srai@redhat.com
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #1507
Checklist:
make codegen
) has been run to update object specifications, if necessary.