Skip to content

Commit

Permalink
docs: add doc to recover from pod from lost node
Browse files Browse the repository at this point in the history
This commit adds the doc which has the manual
steps to recover from the specific scenario
like
`on the node lost, the new pod can't mount the
same volume`.

Closes: #1507
Signed-off-by: subhamkrai <srai@redhat.com>
  • Loading branch information
subhamkrai committed Sep 22, 2021
1 parent e8d540c commit 2bf5381
Showing 1 changed file with 49 additions and 0 deletions.
49 changes: 49 additions & 0 deletions Documentation/ceph-csi-troubleshooting.md
Expand Up @@ -441,3 +441,52 @@ $ rbd ls --id=csi-rbd-node -m=10.111.136.166:6789 --key=AQDpIQhg+v83EhAAgLboWIbl
```

Where `-m` is one of the mon endpoints and the `--key` is the key used by the CSI driver for accessing the Ceph cluster.

## Node Loss

When a node is lost, you will see application pods on the node stuck in the `Terminating` state while another pod is rescheduled and is in the `ContainerCreating` state.

To allow the application pod to start on another node, force delete the pod.

### Force deleting the pod

To force delete the pod stuck in the `Terminating` state:

```console
$ kubectl -n rook-ceph delete pod my-app-69cd495f9b-nl6hf --grace-period 0 --force
```

After the force delete, wait for a timeout of about 8-10 minutes. If the pod still not in the running state, continue with the next section to blacklist the node.

### Blacklisting a node

To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner.

```console
$ PVC_NAME= # enter pvc name
$ IMAGE=$(kubectl get pv PVC_NAME-o jsonpath='{.spec.csi.volumeHandle}' | cut -d '-' -f 6- | awk '{print "csi-vol-"$1}') # enter the pvc name
$ echo $IMAGE
```

The solution is to remove the watcher, following the commands below from the [Rook toolbox](ceph-toolbox.md):

```console
$ rbd status <image> --pool=<pool name> # get image from above output
```
>```
> Watchers:
> watcher=10.130.2.1:0/2076971174 client.14206 cookie=18446462598732840961
>```
```console
$ ceph osd blacklist add 10.130.2.1:0 # to know which watcher to block see above output
blacklisting 10.130.2.1:0
```
### Removing a node blacklist

After running the above command within a few minutes the pod will be running.

```console
$ ceph osd blacklist rm 10.130.2.1:0
un-blacklisting 10.130.2.1:0
```

0 comments on commit 2bf5381

Please sign in to comment.