diff --git a/Documentation/ceph-csi-troubleshooting.md b/Documentation/ceph-csi-troubleshooting.md index d27abee4feabf..31b87c9bc4670 100644 --- a/Documentation/ceph-csi-troubleshooting.md +++ b/Documentation/ceph-csi-troubleshooting.md @@ -441,3 +441,52 @@ $ rbd ls --id=csi-rbd-node -m=10.111.136.166:6789 --key=AQDpIQhg+v83EhAAgLboWIbl ``` Where `-m` is one of the mon endpoints and the `--key` is the key used by the CSI driver for accessing the Ceph cluster. + +## Node Loss + +When a node is lost, you will see application pods on the node stuck in the `Terminating` state while another pod is rescheduled and is in the `ContainerCreating` state. + +To allow the application pod to start on another node, force delete the pod. + +### Force deleting the pod + +To force delete the pod stuck in the `Terminating` state: + +```console +$ kubectl -n rook-ceph delete pod my-app-69cd495f9b-nl6hf --grace-period 0 --force +``` + +After the force delete, wait for a timeout of about 8-10 minutes. If the pod still not in the running state, continue with the next section to blacklist the node. + +### Blacklisting a node + +To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner. + +```console +$ PVC_NAME= # enter pvc name +$ IMAGE=$(kubectl get pv PVC_NAME-o jsonpath='{.spec.csi.volumeHandle}' | cut -d '-' -f 6- | awk '{print "csi-vol-"$1}') # enter the pvc name +$ echo $IMAGE +``` + +The solution is to remove the watcher, following the commands below from the [Rook toolbox](ceph-toolbox.md): + +```console +$ rbd status --pool= # get image from above output +``` +>``` +> Watchers: +> watcher=10.130.2.1:0/2076971174 client.14206 cookie=18446462598732840961 +>``` + +```console +$ ceph osd blacklist add 10.130.2.1:0 # to know which watcher to block see above output +blacklisting 10.130.2.1:0 +``` +### Removing a node blacklist + +After running the above command within a few minutes the pod will be running. + +```console +$ ceph osd blacklist rm 10.130.2.1:0 +un-blacklisting 10.130.2.1:0 +```