Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add doc to recover from pod from lost node #8742

Merged
merged 1 commit into from Oct 5, 2021

Conversation

subhamkrai
Copy link
Contributor

this commit adds the doc which has the manual
steps to recover from the specific scenario
like
on the node lost, the new pod can't mount the same volume.

Closes: #1507
Signed-off-by: subhamkrai srai@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #1507

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

Documentation/ceph-csi-troubleshooting.md Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved

### Shorten the timeout

To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the node is just offline and there is no watcher active, we need to just blacklist the whole node, rather than blacklist just a session id, right? Seems like we could simplify this section to just blacklist the node ip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response? I don't understand why we would want to only backlist only a session id instead of always blocklisting the whole node. The point is also to prevent a node from coming back online and create a new session, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 can you help here? Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to blacklist the IP as we want to block all sessions of that node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so if we want to blacklist all sessions, we only need the node ip, right? And no need to get the PV session IDs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we just need the Node IP to blacklist it.

Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Show resolved Hide resolved
Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start the sentence with a capital letter in your commit message.

Documentation/ceph-csi-troubleshooting.md Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved

### Shorten the timeout

To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response? I don't understand why we would want to only backlist only a session id instead of always blocklisting the whole node. The point is also to prevent a node from coming back online and create a new session, right?

Comment on lines 466 to 490
$ PV_NAME= # enter pv name
$ IMAGE=$(kubectl get pv $PV_NAME-o jsonpath='{.spec.csi.volumeHandle}' | cut -d '-' -f 6- | awk '{print "csi-vol-"$1}')
$ echo $IMAGE
```

The solution is to remove the watcher, following the commands below from the [Rook toolbox](ceph-toolbox.md):

```console
$ rbd status <image> --pool=<pool name> # get image from above output
```
>```
> Watchers:
> watcher=10.130.2.1:0/2076971174 client.14206 cookie=18446462598732840961
>```

```console
$ ceph osd blacklist add 10.130.2.1:0 # to know which watcher to block see above output
blacklisting 10.130.2.1:0
```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisn before I make changes just to confirm I'll remove the above part and will just ask the user to get node IP(the node which is lost) and blacklist that.

and if Ceph version > octopus we'll use ceph osd blacklist else ceph osd blocklist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ceph version is pacific and above use blocklist else use blacklist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we're just blocking the node, rather than a session id.

@mergify
Copy link

mergify bot commented Oct 1, 2021

This pull request has merge conflicts that must be resolved before it can be merged. @subhamkrai please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork

@subhamkrai
Copy link
Contributor Author

@travisn ^^^

Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved
This commit adds the doc which has the manual
steps to recover from the specific scenario
like
`on the node lost, the new pod can't mount the
same volume`.

Closes: rook#1507
Signed-off-by: subhamkrai <srai@redhat.com>
@travisn travisn merged commit 086649c into rook:master Oct 5, 2021
mergify bot added a commit that referenced this pull request Oct 5, 2021
docs: add doc to recover from pod from lost node (backport #8742)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On NodeLost, the new pod can't mount the same volume.
5 participants