docs: add doc to recover from pod from lost node #7282

subhamkrai · 2021-02-22T12:27:06Z

this commit adds the doc which has the manual
steps to recover from the specific scenario
like
on the lost node, the new pod can't mount the same volume.

Signed-off-by: subhamkrai srai@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #1507

Checklist:

subhamkrai · 2021-02-22T12:31:19Z

FWIW, I tried to followed this format for commands.

Documentation/ceph-recover-pod-on-LostNode.md

Documentation/ceph-common-issues.md

Documentation/ceph-recover-pod-on-LostNode.md

netzzer · 2021-04-09T00:05:29Z

Testing with OCP 4.7.2 and OCS 4.7 RC3 shows that if a pod is on a OCP node that is powered off, AND the pod mounts a cephrbd volume, the recovery is as follows while the node is powered off:

Force delete the Terminating pod (oc delete pod <pod_name> --force --grace-period=0)
Wait ~8-10 mins for the new pod to schedule on new OCP node

There is no need to go through the blacklisting process in ceph to get the pod Running again.

leseb · 2021-04-29T12:53:59Z

@subhamkrai what's the status of this PR?

subhamkrai · 2021-04-29T13:23:09Z

@subhamkrai what's the status of this PR?

waiting for @Madhu-1to review @travisn comments.

Madhu-1 · 2021-05-10T04:48:16Z

Testing with OCP 4.7.2 and OCS 4.7 RC3 shows that if a pod is on a OCP node that is powered off, AND the pod mounts a cephrbd volume, the recovery is as follows while the node is powered off:

Force delete the Terminating pod (oc delete pod <pod_name> --force --grace-period=0)

Wait ~8-10 mins for the new pod to schedule on new OCP node

There is no need to go through the blacklisting process in ceph to get the pod Running again.

Yes if the node is dead, RBD removes the watcher after a timeout of 5 minutes I think. we need to blacklist if the kubelet is dead but the application is still running on the node.

sometime Blacklisting helps us to recover fast when the node is dead we don't need to wait for 8-10 minutes

travisn · 2021-05-10T18:07:16Z

Testing with OCP 4.7.2 and OCS 4.7 RC3 shows that if a pod is on a OCP node that is powered off, AND the pod mounts a cephrbd volume, the recovery is as follows while the node is powered off:

Force delete the Terminating pod (oc delete pod <pod_name> --force --grace-period=0)

Wait ~8-10 mins for the new pod to schedule on new OCP node

There is no need to go through the blacklisting process in ceph to get the pod Running again.

Yes if the node is dead, RBD removes the watcher after a timeout of 5 minutes I think. we need to blacklist if the kubelet is dead but the application is still running on the node.

sometime Blacklisting helps us to recover fast when the node is dead we don't need to wait for 8-10 minutes

To summarize, the intention of this PR was to have a simple doc topic that describes how to recover from the node lost scenario. Would documenting these two steps be accurate and sufficient?

If a Kuberenetes node is lost, all pods on that node that are consuming volumes will be in a stuck state until the admin takes action. When you are sure the node is lost:

Force delete your application pod. After some timeout (8-10 minutes), your pod will be rescheduled
- (add example command to force delete)
To shorten the timeout, blacklist the volume so Rook/Ceph can safely failover the pod sooner.
- (add instructions for blacklisting)

Madhu-1 · 2021-05-11T06:11:03Z

Testing with OCP 4.7.2 and OCS 4.7 RC3 shows that if a pod is on a OCP node that is powered off, AND the pod mounts a cephrbd volume, the recovery is as follows while the node is powered off:

Force delete the Terminating pod (oc delete pod <pod_name> --force --grace-period=0)

Wait ~8-10 mins for the new pod to schedule on new OCP node

There is no need to go through the blacklisting process in ceph to get the pod Running again.

Yes if the node is dead, RBD removes the watcher after a timeout of 5 minutes I think. we need to blacklist if the kubelet is dead but the application is still running on the node.
sometime Blacklisting helps us to recover fast when the node is dead we don't need to wait for 8-10 minutes

To summarize, the intention of this PR was to have a simple doc topic that describes how to recover from the node lost scenario. Would documenting these two steps be accurate and sufficient?

If a Kuberenetes node is lost, all pods on that node that are consuming volumes will be in a stuck state until the admin takes action. When you are sure the node is lost:

Force delete your application pod. After some timeout (8-10 minutes), your pod will be rescheduled

(add example command to force delete)

To shorten the timeout, blacklist the volume so Rook/Ceph can safely failover the pod sooner.

(add instructions for blacklisting)

Yes, @travisn sounds good. once we get more enhancements in CSI and kubernetes are we can update the contents on this doc.

Documentation/ceph-csi-troubleshooting.md

travisn · 2021-05-11T17:42:50Z

Documentation/ceph-csi-troubleshooting.md

+blacklisting 10.130.2.1:0/2076971174 until 2021-02-21T18:12:53.115328+0000 (3600 sec)
+```
+
+After running the above command within a few minutes the pod will be running. **But don't forget to remove the above blacklist watcher**


At what point should they remove the watcher? When the node is confirmed dead? What if they don't remove the watcher? Does it never get removed automatically?

If the node is dead the watcher gets automatically removed. If the control plane is dead the application might be running, in that case, the watcher never gets removed then we need to blacklist it.

So we shouldn't tell them to un-blacklist, right? Or else we should be clear when to un-blacklist

so when the control plane is dead in this case only we need un-blacklist because in this only we are blocking watcher, right?

IIUC from this comment, the rbd status may not be reliable to show a watcher, or at least it doesn't guarantee the node is dead. So it seems like we need to recommend they blacklist a whole node until they can guarantee that node is dead. Otherwise, they risk corruption.

Yes, a node might be disconnected from the kubernetes cluster. we can blacklist the node IP in the ceph cluster.

Documentation/ceph-csi-troubleshooting.md

Madhu-1 · 2021-05-12T06:14:56Z

Documentation/ceph-csi-troubleshooting.md

+blacklisting 10.130.2.1:0/2076971174 until 2021-02-21T18:12:53.115328+0000 (3600 sec)
+```
+
+After running the above command within a few minutes the pod will be running. **But don't forget to remove the above blacklist watcher**


If the node is dead the watcher gets automatically removed. If the control plane is dead the application might be running, in that case, the watcher never gets removed then we need to blacklist it.

Documentation/ceph-csi-troubleshooting.md

Madhu-1 · 2021-05-12T12:51:09Z

@travisn I and @ShyamsundarR were discussing fencing we see two problems here.

Recently cephcsi pods in Rook have been updated to move away from host networking and I don't know we can/cannot hit this problem with multus networking.

The below problem exists for both cephfs and RBD.

as we are moving to the pod networking, the pod IP of rbdplugin and the cephfsplugin will be used as a watcher, we can fence that IP but the problem is what if the pod is just restarted and got a new IP? again the pods running on the node which is disconnected are going to get read and write access to the volumes and can corrupt the data?.
If the plugin pod on the other nodes is restarted and there is a slight chance that it can get the IP that is already fenced on the ceph side which can lead to read/write access denied to the existing applications on the node.

@ShyamsundarR did I missed anything?

travisn · 2021-05-12T14:22:13Z

@Madhu-1 Let's open a new issue to track that issue related to host networking. Do we need to recommend re-enabling host networking until it's resolved?

Madhu-1 · 2021-05-12T14:24:54Z

Do we need to recommend re-enabling host networking until it's resolved?

@travisn I need to do some more experimentation and will get back to you on this one.

travisn

A few small comments, but let's hold off merging until the comment about host networking is resolved in case it affects these instructions.

Documentation/ceph-csi-troubleshooting.md

travisn · 2021-05-12T15:15:38Z

Documentation/ceph-csi-troubleshooting.md

+blacklisting 10.130.2.1:0/2076971174 until 2021-02-21T18:12:53.115328+0000 (3600 sec)
+```
+
+After running the above command within a few minutes the pod will be running. **But don't forget to remove the above blacklist watcher**


So we shouldn't tell them to un-blacklist, right? Or else we should be clear when to un-blacklist

this commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same volume`. Signed-off-by: subhamkrai <srai@redhat.com>

Madhu-1 · 2021-05-20T08:07:32Z

A few small comments, but let's hold off merging until the comment about host networking is resolved in case it affects these instructions.

Lets hear from rbd expert @idryomov based on that we can decide what we can do.

idryomov

This seems prone to data corruption to me. I don't see anything guaranteeing that the node is really lost.

If there is a watcher listed in rbd status output then (barring bugs on the OSD side) the node is definitely not lost. This is an easy case: blocklisting the client is the right thing to do.

The problem is that lack of a watch can't be used as an indicator of whether the client is really dead. Quoting my comment on another issue:

Please note that listing watchers and relying on that for telling what needs to be blocklisted is unreliable. An image may still be mapped somewhere (and be written to!) with no watchers in rbd status output.

And there are no watches in the CephFS case at all.

For this procedure to be safe, the node that the pod is being moved from needs to be STONITHed prior to force deleting the pod and allowing Kubernetes to reschedule it. Alternatively, if taking out the entire node is undesirable, something needs to keep track of Ceph entity addresses (10.130.2.1:0/2076971174 in the proposed change) and unconditionally blocklist based on that, without consulting rbd status.

github-actions · 2021-08-19T20:02:03Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

leseb · 2021-09-09T13:36:48Z

@subhamkrai what's the status of this PR?

subhamkrai · 2021-09-14T10:56:11Z

@Madhu-1 can you just confirm what changes are left to be done? Thanks

subhamkrai · 2021-09-17T09:50:18Z

Sorry, but I have lost my origin for this branch, created a new one with the same changes #8742. Thanks

subhamkrai requested review from Madhu-1, travisn and leseb February 22, 2021 12:28

travisn requested changes Feb 22, 2021

View reviewed changes

subhamkrai added csi docs troubleshooting labels May 11, 2021

travisn requested changes May 11, 2021

View reviewed changes

Madhu-1 reviewed May 12, 2021

View reviewed changes

travisn requested changes May 12, 2021

View reviewed changes

docs: add doc to recover from pod from lost node

2a71fa6

this commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same volume`. Signed-off-by: subhamkrai <srai@redhat.com>

Madhu-1 mentioned this pull request May 20, 2021

Fencing Question for rbd and cephfs to recover from node lost #7954

Closed

idryomov reviewed May 21, 2021

View reviewed changes

github-actions bot added the stale Labeled by the stale bot label Aug 19, 2021

travisn removed the stale Labeled by the stale bot label Aug 19, 2021

subhamkrai closed this Sep 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add doc to recover from pod from lost node #7282

docs: add doc to recover from pod from lost node #7282

subhamkrai commented Feb 22, 2021 •

edited by travisn

subhamkrai commented Feb 22, 2021 •

edited

netzzer commented Apr 9, 2021

leseb commented Apr 29, 2021

subhamkrai commented Apr 29, 2021

Madhu-1 commented May 10, 2021

travisn commented May 10, 2021

Madhu-1 commented May 11, 2021

travisn May 11, 2021

Madhu-1 May 12, 2021

travisn May 12, 2021

subhamkrai May 13, 2021

travisn Sep 14, 2021

Madhu-1 Sep 15, 2021

Madhu-1 May 12, 2021

Madhu-1 commented May 12, 2021

travisn commented May 12, 2021

Madhu-1 commented May 12, 2021

travisn left a comment

travisn May 12, 2021

Madhu-1 commented May 20, 2021

idryomov left a comment

github-actions bot commented Aug 19, 2021

leseb commented Sep 9, 2021

subhamkrai commented Sep 14, 2021

subhamkrai commented Sep 17, 2021

docs: add doc to recover from pod from lost node #7282

docs: add doc to recover from pod from lost node #7282

Conversation

subhamkrai commented Feb 22, 2021 • edited by travisn

subhamkrai commented Feb 22, 2021 • edited

netzzer commented Apr 9, 2021

leseb commented Apr 29, 2021

subhamkrai commented Apr 29, 2021

Madhu-1 commented May 10, 2021

travisn commented May 10, 2021

Madhu-1 commented May 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Madhu-1 commented May 12, 2021

travisn commented May 12, 2021

Madhu-1 commented May 12, 2021

travisn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Madhu-1 commented May 20, 2021

idryomov left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 19, 2021

leseb commented Sep 9, 2021

subhamkrai commented Sep 14, 2021

subhamkrai commented Sep 17, 2021

subhamkrai commented Feb 22, 2021 •

edited by travisn

subhamkrai commented Feb 22, 2021 •

edited