docs: add doc to recover from pod from lost node

This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same volume`. Closes: #1507 Signed-off-by: subhamkrai <srai@redhat.com>
rook · Sep 22, 2021 · 2bf5381 · 2bf5381
1 parent e8d540c
commit 2bf5381
Showing 1 changed file with 49 additions and 0 deletions.
diff --git a/Documentation/ceph-csi-troubleshooting.md b/Documentation/ceph-csi-troubleshooting.md
@@ -441,3 +441,52 @@ $ rbd ls --id=csi-rbd-node -m=10.111.136.166:6789 --key=AQDpIQhg+v83EhAAgLboWIbl
 ```
 
 Where `-m` is one of the mon endpoints and the `--key` is the key used by the CSI driver for accessing the Ceph cluster.
+
+## Node Loss
+
+When a node is lost, you will see application pods on the node stuck in the `Terminating` state while another pod is rescheduled and is in the `ContainerCreating` state.
+
+To allow the application pod to start on another node, force delete the pod.
+
+###  Force deleting the pod
+
+To force delete the pod stuck in the `Terminating` state:
+
+```console
+$ kubectl -n rook-ceph delete pod my-app-69cd495f9b-nl6hf --grace-period 0 --force
+```
+
+After the force delete, wait for a timeout of about 8-10 minutes. If the pod still not in the running state, continue with the next section to blacklist the node.
+
+### Blacklisting a node
+
+To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner.
+
+```console
+$ PVC_NAME= # enter pvc name
+$ IMAGE=$(kubectl get pv PVC_NAME-o jsonpath='{.spec.csi.volumeHandle}' | cut -d '-' -f 6- | awk '{print "csi-vol-"$1}')  # enter the pvc name
+$ echo $IMAGE
+```
+
+The solution is to remove the watcher, following the commands below from the [Rook toolbox](ceph-toolbox.md):
+
+```console
+$ rbd status <image> --pool=<pool name> # get image from above output
+```
+>```
+> Watchers:
+>	watcher=10.130.2.1:0/2076971174 client.14206 cookie=18446462598732840961
+>```
+
+```console
+$ ceph osd blacklist add 10.130.2.1:0 # to know which watcher to block see above output
+blacklisting 10.130.2.1:0
+```
+### Removing a node blacklist
+
+After running the above command within a few minutes the pod will be running.
+
+```console
+$ ceph osd blacklist rm 10.130.2.1:0
+un-blacklisting 10.130.2.1:0
+```