Skip to content

Commit

Permalink
docs: add doc to recover from pod from lost node
Browse files Browse the repository at this point in the history
This commit adds the doc which has the manual
steps to recover from the specific scenario
like
`on the node lost, the new pod can't mount the
same volume`.

Closes: #1507
Signed-off-by: subhamkrai <srai@redhat.com>
  • Loading branch information
subhamkrai committed Oct 1, 2021
1 parent c084ee7 commit 29b1391
Showing 1 changed file with 52 additions and 0 deletions.
52 changes: 52 additions & 0 deletions Documentation/ceph-csi-troubleshooting.md
Expand Up @@ -441,3 +441,55 @@ $ rbd ls --id=csi-rbd-node -m=10.111.136.166:6789 --key=AQDpIQhg+v83EhAAgLboWIbl
```

Where `-m` is one of the mon endpoints and the `--key` is the key used by the CSI driver for accessing the Ceph cluster.

## Node Loss

When a node is lost, you will see application pods on the node stuck in the `Terminating` state while another pod is rescheduled and is in the `ContainerCreating` state.

To allow the application pod to start on another node, force delete the pod.

### Force deleting the pod

To force delete the pod stuck in the `Terminating` state:

```console
$ kubectl -n rook-ceph delete pod my-app-69cd495f9b-nl6hf --grace-period 0 --force
```

After the force delete, wait for a timeout of about 8-10 minutes. If the pod still not in the running state, continue with the next section to blacklist the node.

### Blacklisting a node

To shorten the timeout, you can mark the node as "blacklisted" from the [Rook toolbox](ceph-toolbox.md) so Rook can safely failover the pod sooner.

For, Ceph version is atleast Pacific(v16.2.x) run below command:

```console
$ ceph osd blocklist add <NODE_IP> # get the node IP you want to blocklist
blocklisting <NODE_IP>
```

For, Ceph version is Octopus(v15.2.x) or older run below command, run:

```console
$ ceph osd blacklist add <NODE_IP> # get the node IP you want to blacklist
blacklisting <NODE_IP>
```

### Removing a node blacklist

After running the above command within a few minutes the pod will be running.

For, Ceph version is atleast Pacific(v16.2.x) run below command:

```console
$ ceph osd blocklist rm <NODE_IP>
un-blocklisting <NODE_IP>
```

For, Ceph version is Octopus(v15.2.x) or older run below command, run:

```console
$ ceph osd blacklist rm <NODE_IP> # get the node IP you want to blacklist
un-blacklisting <NODE_IP>
```

0 comments on commit 29b1391

Please sign in to comment.