Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add doc to recover from pod from lost node #7282

Closed
wants to merge 1 commit into from
Closed

docs: add doc to recover from pod from lost node #7282

wants to merge 1 commit into from

Conversation

subhamkrai
Copy link
Contributor

@subhamkrai subhamkrai commented Feb 22, 2021

this commit adds the doc which has the manual
steps to recover from the specific scenario
like
on the lost node, the new pod can't mount the same volume.

Signed-off-by: subhamkrai srai@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #1507

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@subhamkrai
Copy link
Contributor Author

subhamkrai commented Feb 22, 2021

FWIW, I tried to followed this format for commands.

Documentation/ceph-recover-pod-on-LostNode.md Outdated Show resolved Hide resolved
Documentation/ceph-common-issues.md Outdated Show resolved Hide resolved
Documentation/ceph-recover-pod-on-LostNode.md Outdated Show resolved Hide resolved
Documentation/ceph-recover-pod-on-LostNode.md Outdated Show resolved Hide resolved
Documentation/ceph-recover-pod-on-LostNode.md Outdated Show resolved Hide resolved
Documentation/ceph-recover-pod-on-LostNode.md Outdated Show resolved Hide resolved
Documentation/ceph-recover-pod-on-LostNode.md Outdated Show resolved Hide resolved
Documentation/ceph-recover-pod-on-LostNode.md Outdated Show resolved Hide resolved
Documentation/ceph-recover-pod-on-LostNode.md Outdated Show resolved Hide resolved
Documentation/ceph-recover-pod-on-LostNode.md Outdated Show resolved Hide resolved
@netzzer
Copy link

netzzer commented Apr 9, 2021

Testing with OCP 4.7.2 and OCS 4.7 RC3 shows that if a pod is on a OCP node that is powered off, AND the pod mounts a cephrbd volume, the recovery is as follows while the node is powered off:

  1. Force delete the Terminating pod (oc delete pod <pod_name> --force --grace-period=0)
  2. Wait ~8-10 mins for the new pod to schedule on new OCP node

There is no need to go through the blacklisting process in ceph to get the pod Running again.

@leseb
Copy link
Member

leseb commented Apr 29, 2021

@subhamkrai what's the status of this PR?

@subhamkrai
Copy link
Contributor Author

@subhamkrai what's the status of this PR?

waiting for @Madhu-1to review @travisn comments.

@Madhu-1
Copy link
Member

Madhu-1 commented May 10, 2021

Testing with OCP 4.7.2 and OCS 4.7 RC3 shows that if a pod is on a OCP node that is powered off, AND the pod mounts a cephrbd volume, the recovery is as follows while the node is powered off:

  1. Force delete the Terminating pod (oc delete pod <pod_name> --force --grace-period=0)
  2. Wait ~8-10 mins for the new pod to schedule on new OCP node

There is no need to go through the blacklisting process in ceph to get the pod Running again.

Yes if the node is dead, RBD removes the watcher after a timeout of 5 minutes I think. we need to blacklist if the kubelet is dead but the application is still running on the node.

sometime Blacklisting helps us to recover fast when the node is dead we don't need to wait for 8-10 minutes

@travisn
Copy link
Member

travisn commented May 10, 2021

Testing with OCP 4.7.2 and OCS 4.7 RC3 shows that if a pod is on a OCP node that is powered off, AND the pod mounts a cephrbd volume, the recovery is as follows while the node is powered off:

  1. Force delete the Terminating pod (oc delete pod <pod_name> --force --grace-period=0)
  2. Wait ~8-10 mins for the new pod to schedule on new OCP node

There is no need to go through the blacklisting process in ceph to get the pod Running again.

Yes if the node is dead, RBD removes the watcher after a timeout of 5 minutes I think. we need to blacklist if the kubelet is dead but the application is still running on the node.

sometime Blacklisting helps us to recover fast when the node is dead we don't need to wait for 8-10 minutes

To summarize, the intention of this PR was to have a simple doc topic that describes how to recover from the node lost scenario. Would documenting these two steps be accurate and sufficient?


If a Kuberenetes node is lost, all pods on that node that are consuming volumes will be in a stuck state until the admin takes action. When you are sure the node is lost:

  1. Force delete your application pod. After some timeout (8-10 minutes), your pod will be rescheduled
    • (add example command to force delete)
  2. To shorten the timeout, blacklist the volume so Rook/Ceph can safely failover the pod sooner.
    • (add instructions for blacklisting)

@Madhu-1
Copy link
Member

Madhu-1 commented May 11, 2021

Testing with OCP 4.7.2 and OCS 4.7 RC3 shows that if a pod is on a OCP node that is powered off, AND the pod mounts a cephrbd volume, the recovery is as follows while the node is powered off:

  1. Force delete the Terminating pod (oc delete pod <pod_name> --force --grace-period=0)
  2. Wait ~8-10 mins for the new pod to schedule on new OCP node

There is no need to go through the blacklisting process in ceph to get the pod Running again.

Yes if the node is dead, RBD removes the watcher after a timeout of 5 minutes I think. we need to blacklist if the kubelet is dead but the application is still running on the node.
sometime Blacklisting helps us to recover fast when the node is dead we don't need to wait for 8-10 minutes

To summarize, the intention of this PR was to have a simple doc topic that describes how to recover from the node lost scenario. Would documenting these two steps be accurate and sufficient?

If a Kuberenetes node is lost, all pods on that node that are consuming volumes will be in a stuck state until the admin takes action. When you are sure the node is lost:

  1. Force delete your application pod. After some timeout (8-10 minutes), your pod will be rescheduled

    • (add example command to force delete)
  2. To shorten the timeout, blacklist the volume so Rook/Ceph can safely failover the pod sooner.

    • (add instructions for blacklisting)

Yes, @travisn sounds good. once we get more enhancements in CSI and kubernetes are we can update the contents on this doc.

Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
blacklisting 10.130.2.1:0/2076971174 until 2021-02-21T18:12:53.115328+0000 (3600 sec)
```

After running the above command within a few minutes the pod will be running. **But don't forget to remove the above blacklist watcher**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what point should they remove the watcher? When the node is confirmed dead? What if they don't remove the watcher? Does it never get removed automatically?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node is dead the watcher gets automatically removed. If the control plane is dead the application might be running, in that case, the watcher never gets removed then we need to blacklist it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we shouldn't tell them to un-blacklist, right? Or else we should be clear when to un-blacklist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so when the control plane is dead in this case only we need un-blacklist because in this only we are blocking watcher, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC from this comment, the rbd status may not be reliable to show a watcher, or at least it doesn't guarantee the node is dead. So it seems like we need to recommend they blacklist a whole node until they can guarantee that node is dead. Otherwise, they risk corruption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a node might be disconnected from the kubernetes cluster. we can blacklist the node IP in the ceph cluster.

Documentation/ceph-csi-troubleshooting.md Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
blacklisting 10.130.2.1:0/2076971174 until 2021-02-21T18:12:53.115328+0000 (3600 sec)
```

After running the above command within a few minutes the pod will be running. **But don't forget to remove the above blacklist watcher**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node is dead the watcher gets automatically removed. If the control plane is dead the application might be running, in that case, the watcher never gets removed then we need to blacklist it.

Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
@Madhu-1
Copy link
Member

Madhu-1 commented May 12, 2021

@travisn I and @ShyamsundarR were discussing fencing we see two problems here.

Recently cephcsi pods in Rook have been updated to move away from host networking and I don't know we can/cannot hit this problem with multus networking.

The below problem exists for both cephfs and RBD.

  • as we are moving to the pod networking, the pod IP of rbdplugin and the cephfsplugin will be used as a watcher, we can fence that IP but the problem is what if the pod is just restarted and got a new IP? again the pods running on the node which is disconnected are going to get read and write access to the volumes and can corrupt the data?.
  • If the plugin pod on the other nodes is restarted and there is a slight chance that it can get the IP that is already fenced on the ceph side which can lead to read/write access denied to the existing applications on the node.

@ShyamsundarR did I missed anything?

@travisn
Copy link
Member

travisn commented May 12, 2021

@Madhu-1 Let's open a new issue to track that issue related to host networking. Do we need to recommend re-enabling host networking until it's resolved?

@Madhu-1
Copy link
Member

Madhu-1 commented May 12, 2021

Do we need to recommend re-enabling host networking until it's resolved?

@travisn I need to do some more experimentation and will get back to you on this one.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, but let's hold off merging until the comment about host networking is resolved in case it affects these instructions.

Documentation/ceph-csi-troubleshooting.md Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
blacklisting 10.130.2.1:0/2076971174 until 2021-02-21T18:12:53.115328+0000 (3600 sec)
```

After running the above command within a few minutes the pod will be running. **But don't forget to remove the above blacklist watcher**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we shouldn't tell them to un-blacklist, right? Or else we should be clear when to un-blacklist

this commit adds the doc which has the manual
steps to recover from the specific scenario
like
`on the node lost, the new pod can't mount the
same volume`.

Signed-off-by: subhamkrai <srai@redhat.com>
@Madhu-1
Copy link
Member

Madhu-1 commented May 20, 2021

A few small comments, but let's hold off merging until the comment about host networking is resolved in case it affects these instructions.

Lets hear from rbd expert @idryomov based on that we can decide what we can do.

Copy link

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems prone to data corruption to me. I don't see anything guaranteeing that the node is really lost.

If there is a watcher listed in rbd status output then (barring bugs on the OSD side) the node is definitely not lost. This is an easy case: blocklisting the client is the right thing to do.

The problem is that lack of a watch can't be used as an indicator of whether the client is really dead. Quoting my comment on another issue:

Please note that listing watchers and relying on that for telling what needs to be blocklisted is unreliable. An image may still be mapped somewhere (and be written to!) with no watchers in rbd status output.

And there are no watches in the CephFS case at all.

For this procedure to be safe, the node that the pod is being moved from needs to be STONITHed prior to force deleting the pod and allowing Kubernetes to reschedule it. Alternatively, if taking out the entire node is undesirable, something needs to keep track of Ceph entity addresses (10.130.2.1:0/2076971174 in the proposed change) and unconditionally blocklist based on that, without consulting rbd status.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Labeled by the stale bot label Aug 19, 2021
@travisn travisn removed the stale Labeled by the stale bot label Aug 19, 2021
@leseb
Copy link
Member

leseb commented Sep 9, 2021

@subhamkrai what's the status of this PR?

@subhamkrai
Copy link
Contributor Author

@Madhu-1 can you just confirm what changes are left to be done? Thanks

@subhamkrai
Copy link
Contributor Author

Sorry, but I have lost my origin for this branch, created a new one with the same changes #8742. Thanks

@subhamkrai subhamkrai closed this Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On NodeLost, the new pod can't mount the same volume.
6 participants