Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add document for failover and failback #8411

Merged
merged 4 commits into from Oct 26, 2021

Conversation

Yuggupta27
Copy link
Contributor

@Yuggupta27 Yuggupta27 commented Jul 28, 2021

add document to track the steps for failover
and failback in case of Async DR; for Planned
Migration and Disaster Recovery use case.

Signed-off-by: Yug Gupta ygupta@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #7034

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
@@ -0,0 +1,460 @@
# Failover and Failback in RBD Async Disaster Recovery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the documentation files need a header such as this. The title is for the table of contents (and should be short), and the weight depends on where it should should up in the table of contents. What if we add it after the snapshots topic and before the volume cloning topic? In that case, we could set this to 3240, change snapshots to 3230, and leave volume cloning at 3250.

---
title: Async DR Failover
weight: 3240
indent: true
---

Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
kubectl get pvc rbd-pvc -oyaml >pvc-backup.yaml
```

* Take a backup of the PV, corresponding to the PVC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to take a backup of the PV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we create a static binding between the PVCs and the PVs on the secondary cluster, new PVs are not created there. The restored PV is then bound to the restored PVC.

Documentation/async-disaster-recovery.md Show resolved Hide resolved
@@ -0,0 +1,460 @@
# Failover and Failback in RBD Async Disaster Recovery

[RBD mirroring](https://docs.ceph.com/en/latest/rbd/rbd-mirroring/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RBD Mirroring is an extremely complicated feature and books could be written on it. So the goal of our docs in Rook is to make it as concise and clear as possible for users to follow, and provide links to more details if they want to understand more deeply. How we present the docs will really make a difference for whether the feature will be used, so it is important to spend time organizing and clarifying the docs. Since so much time is invested in implementing the feature, we need to invest time in the docs by multiple members of the team, and testing the instructions to make sure everything works end-to-end.

Some more specific suggestions:

  • We really need to break this down into multiple docs, it is just too complicated for a single doc. One doc would be an overview, then perhaps separate docs for initial configuration of the clusters, configuration of pools, then a doc for steps to failover/failback. We may want to make the overview page a top-level topic (similar to Ceph Tools) with multiple subtopics).
  • The docs need an overview section at the top of the doc that explain what the user will accomplish in that doc
  • Longer docs could use a table of contents with links to each section in the doc, see this example

At the end of the day, a non-rbd-replication-expert (including me) should be able to follow the docs and get everything working. I do best with simple features, so let's see how to simplify as much as possible. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, breaking it down into multiple docs, with each doc covering certain aspects seems good to me. As we can then take the liberty of being descriptive at places as we have multiple docs with each doc focused on a part.

  1. Initial doc can shed some light on the Overview.
  2. Second doc can cover the Disaster recovery description and how we can set up DR with rook.
  3. Lastly, in the third doc, we can cover all the failover and failback steps for Planned migration and DR use cases.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sounds good. For 2 and 3 we could also consider multiple docs for each one if needed, but maybe not. Let's see how it looks with just the 3 docs to start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Now we have 3 separate docs for the same. One added benefit of the overview doc is that, once we have a doc for failover and failback steps for Metro DR, we can link it in the Overview doc as well :)

Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/planned-migration-and-dr.md Outdated Show resolved Hide resolved
Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong indent

Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
spec:
# the number of rbd-mirror daemons to deploy
count: 1
peers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want users to use cephblockpool to add peer secrets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 we have recently changed that I don't recommend to use the CephRBDMirror resource, please use CephBlockPool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. Updated the doc with the required changes!

Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Copy link
Contributor

@sp98 sp98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster names should be consistent throughout the doc. Some places use cluster-a and some use cluster-1. Same with the use cluster-b and cluster-2. Mentioned a couple of such instances below but there can be more.

Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/planned-migration-and-dr.md Outdated Show resolved Hide resolved
Documentation/planned-migration-and-dr.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
spec:
# the number of rbd-mirror daemons to deploy
count: 1
peers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 we have recently changed that I don't recommend to use the CephRBDMirror resource, please use CephBlockPool.

Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Show resolved Hide resolved
enabled: true
mode: image
# schedule(s) of snapshot
snapshotSchedules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VolumeReplicationClass passes snap sched already, do we need that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuggupta27 Can this be removed as @leseb suggested?

enabled: true
mode: image
# schedule(s) of snapshot
snapshotSchedules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VolumeReplicationClass passes snap sched already, do we need that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it! updated now.

Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/planned-migration-and-dr.md Outdated Show resolved Hide resolved
Documentation/disaster-recovery-overview.md Outdated Show resolved Hide resolved
Documentation/disaster-recovery-overview.md Outdated Show resolved Hide resolved
Documentation/disaster-recovery-overview.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/async-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
Documentation/rbd-mirroring.md Outdated Show resolved Hide resolved
kubectl get cephblockpool.ceph.rook.io/mirroredpool -n rook-ceph --context=cluster-2 -ojsonpath='{.status.info.rbdMirrorBootstrapPeerSecretName}'
```

> ```bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh did you resolve this comment because it's not an obvious response? I can see keeping it, just wanted to confirm since you didn't comment before resolving.

Documentation/async-disaster-recovery.md Show resolved Hide resolved
Documentation/async-disaster-recovery.md Show resolved Hide resolved
from `secondary` to `primary` in primary site.
* Scale up the applications again on the primary site.

## Appendix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let's keep it in the appendix

Documentation/async-disaster-recovery.md Show resolved Hide resolved

### Backup & Restore

Here, we take a backup of PVC and PV object on one site, so that they can be restored later to the peer cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why are we backing up and restoring the PVC and PV? It doesn't backup the content of the PV, it's just the PVC and PV specs, right? Why do these need to be backed up? This seems independent of the DR scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we are mirroring the backup image, the Kubernetes objects(PVCs and PVs) still need to be recreated by the admin (See this). For this, users can take backup via tools like velero; If not, they can manually take a backup too with the steps mentioned in the document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to simplify creating the PVCs and PVs in the second cluster? If so, this seems like more of a step to run before you ever need to failover or fail back, right? If you wait to restore them until your cluster fails, you will be able to create the PVC/PVs in the other cluster, but the data is unrecoverable at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Now I have moved the Backup and Restore section from the appendix of async dr doc to one of the steps while setting up RBD mirroring; so that the user is aware of taking the backup initially as one of the steps.
Also, added notes at multiple places to highlight the same

The document tracks the steps which are required
to set-up rbd mirroring on clusters.

Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple final nits, thanks!

namespace: rook-ceph
spec:
replicated:
size: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this size 1 just for testing? For the example, how about 3?

Suggested change
size: 1
size: 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 👍

Once the failed cluster is recovered on the primary site and you want to failback
from secondary site, follow the below steps:

* Scale down the running applications(if any) on the primary site.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these are mostly fixed except this instance :)

@travisn travisn requested review from leseb and sp98 and removed request for leseb and sp98 October 20, 2021 17:28
add a document to track the steps for failover
and failback in case of Async DR; for Planned
Migration and Disaster Recovery use case.

Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit and we should be good to go.

cluster/examples/kubernetes/ceph/pool-mirrored.yaml Outdated Show resolved Hide resolved
cluster/examples/kubernetes/ceph/pool-mirrored.yaml Outdated Show resolved Hide resolved
Add a new yaml for creating pools that
have mirroring enabled.

Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
Add a new yaml for creating volume replicationclass
and volume replication cr.

Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
@Yuggupta27
Copy link
Contributor Author

One small nit and we should be good to go.

Thanks for the suggestions! Updated 👍

@Yuggupta27 Yuggupta27 requested a review from leseb October 25, 2021 14:08
@travisn travisn merged commit a861bf0 into rook:master Oct 26, 2021
leseb added a commit that referenced this pull request Oct 27, 2021
docs: add document for failover and failback (backport #8411)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Documentation for RBD Async Disaster Recover
6 participants