diff --git a/Documentation/ceph-csi-troubleshooting.md b/Documentation/ceph-csi-troubleshooting.md index d27abee4feabf..922ccd33a7063 100644 --- a/Documentation/ceph-csi-troubleshooting.md +++ b/Documentation/ceph-csi-troubleshooting.md @@ -441,3 +441,55 @@ $ rbd ls --id=csi-rbd-node -m=10.111.136.166:6789 --key=AQDpIQhg+v83EhAAgLboWIbl ``` Where `-m` is one of the mon endpoints and the `--key` is the key used by the CSI driver for accessing the Ceph cluster. + +## Node Loss + +When a node is lost, you will see application pods on the node stuck in the `Terminating` state while another pod is rescheduled and is in the `ContainerCreating` state. + +To allow the application pod to start on another node, force delete the pod. + +### Force deleting the pod + +To force delete the pod stuck in the `Terminating` state: + +```console +$ kubectl -n rook-ceph delete pod my-app-69cd495f9b-nl6hf --grace-period 0 --force +``` + +After the force delete, wait for a timeout of about 8-10 minutes. If the pod still not in the running state, continue with the next section to blacklist the node. + +### Blacklisting a node + +To shorten the timeout, you can mark the node as "blacklisted" from the [Rook toolbox](ceph-toolbox.md) so Rook can safely failover the pod sooner. + +For, Ceph version is atleast Pacific(v16.2.x) run below command: + +```console +$ ceph osd blocklist add # get the node IP you want to blocklist +blocklisting +``` + +For, Ceph version is Octopus(v15.2.x) or older run below command, run: + +```console +$ ceph osd blacklist add # get the node IP you want to blacklist +blacklisting +``` + +### Removing a node blacklist + +After running the above command within a few minutes the pod will be running. + +For, Ceph version is atleast Pacific(v16.2.x) run below command: + +```console +$ ceph osd blocklist rm +un-blocklisting +``` + +For, Ceph version is Octopus(v15.2.x) or older run below command, run: + +```console +$ ceph osd blacklist rm # get the node IP you want to blacklist +un-blacklisting +```