New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add document for failover and failback #8411
Conversation
8c62f04
to
9e11674
Compare
@@ -0,0 +1,460 @@ | |||
# Failover and Failback in RBD Async Disaster Recovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the documentation files need a header such as this. The title is for the table of contents (and should be short), and the weight depends on where it should should up in the table of contents. What if we add it after the snapshots topic and before the volume cloning topic? In that case, we could set this to 3240, change snapshots to 3230, and leave volume cloning at 3250.
---
title: Async DR Failover
weight: 3240
indent: true
---
kubectl get pvc rbd-pvc -oyaml >pvc-backup.yaml | ||
``` | ||
|
||
* Take a backup of the PV, corresponding to the PVC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to take a backup of the PV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we create a static binding between the PVCs and the PVs on the secondary cluster, new PVs are not created there. The restored PV is then bound to the restored PVC.
@@ -0,0 +1,460 @@ | |||
# Failover and Failback in RBD Async Disaster Recovery | |||
|
|||
[RBD mirroring](https://docs.ceph.com/en/latest/rbd/rbd-mirroring/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RBD Mirroring is an extremely complicated feature and books could be written on it. So the goal of our docs in Rook is to make it as concise and clear as possible for users to follow, and provide links to more details if they want to understand more deeply. How we present the docs will really make a difference for whether the feature will be used, so it is important to spend time organizing and clarifying the docs. Since so much time is invested in implementing the feature, we need to invest time in the docs by multiple members of the team, and testing the instructions to make sure everything works end-to-end.
Some more specific suggestions:
- We really need to break this down into multiple docs, it is just too complicated for a single doc. One doc would be an overview, then perhaps separate docs for initial configuration of the clusters, configuration of pools, then a doc for steps to failover/failback. We may want to make the overview page a top-level topic (similar to Ceph Tools) with multiple subtopics).
- The docs need an overview section at the top of the doc that explain what the user will accomplish in that doc
- Longer docs could use a table of contents with links to each section in the doc, see this example
At the end of the day, a non-rbd-replication-expert (including me) should be able to follow the docs and get everything working. I do best with simple features, so let's see how to simplify as much as possible. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, breaking it down into multiple docs, with each doc covering certain aspects seems good to me. As we can then take the liberty of being descriptive at places as we have multiple docs with each doc focused on a part.
- Initial doc can shed some light on the Overview.
- Second doc can cover the Disaster recovery description and how we can set up DR with rook.
- Lastly, in the third doc, we can cover all the failover and failback steps for Planned migration and DR use cases.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sounds good. For 2 and 3 we could also consider multiple docs for each one if needed, but maybe not. Let's see how it looks with just the 3 docs to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Now we have 3 separate docs for the same. One added benefit of the overview doc is that, once we have a doc for failover and failback steps for Metro DR, we can link it in the Overview doc as well :)
9e11674
to
0b69e9a
Compare
0b69e9a
to
fb5c7eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong indent
fb5c7eb
to
e1b1d48
Compare
Documentation/rbd-mirroring.md
Outdated
spec: | ||
# the number of rbd-mirror daemons to deploy | ||
count: 1 | ||
peers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want users to use cephblockpool
to add peer secrets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we have recently changed that I don't recommend to use the CephRBDMirror
resource, please use CephBlockPool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. Updated the doc with the required changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster names should be consistent throughout the doc. Some places use cluster-a
and some use cluster-1
. Same with the use cluster-b
and cluster-2
. Mentioned a couple of such instances below but there can be more.
e1b1d48
to
cbd981d
Compare
Documentation/rbd-mirroring.md
Outdated
spec: | ||
# the number of rbd-mirror daemons to deploy | ||
count: 1 | ||
peers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we have recently changed that I don't recommend to use the CephRBDMirror
resource, please use CephBlockPool
.
enabled: true | ||
mode: image | ||
# schedule(s) of snapshot | ||
snapshotSchedules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VolumeReplicationClass
passes snap sched already, do we need that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuggupta27 Can this be removed as @leseb suggested?
Documentation/rbd-mirroring.md
Outdated
enabled: true | ||
mode: image | ||
# schedule(s) of snapshot | ||
snapshotSchedules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VolumeReplicationClass
passes snap sched already, do we need that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching it! updated now.
cbd981d
to
49cf506
Compare
0461221
to
0d7532f
Compare
0d7532f
to
70c917d
Compare
Documentation/rbd-mirroring.md
Outdated
kubectl get cephblockpool.ceph.rook.io/mirroredpool -n rook-ceph --context=cluster-2 -ojsonpath='{.status.info.rbdMirrorBootstrapPeerSecretName}' | ||
``` | ||
|
||
> ```bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh did you resolve this comment because it's not an obvious response? I can see keeping it, just wanted to confirm since you didn't comment before resolving.
from `secondary` to `primary` in primary site. | ||
* Scale up the applications again on the primary site. | ||
|
||
## Appendix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok let's keep it in the appendix
|
||
### Backup & Restore | ||
|
||
Here, we take a backup of PVC and PV object on one site, so that they can be restored later to the peer cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, why are we backing up and restoring the PVC and PV? It doesn't backup the content of the PV, it's just the PVC and PV specs, right? Why do these need to be backed up? This seems independent of the DR scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we are mirroring the backup image, the Kubernetes objects(PVCs and PVs) still need to be recreated by the admin (See this). For this, users can take backup via tools like velero; If not, they can manually take a backup too with the steps mentioned in the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to simplify creating the PVCs and PVs in the second cluster? If so, this seems like more of a step to run before you ever need to failover or fail back, right? If you wait to restore them until your cluster fails, you will be able to create the PVC/PVs in the other cluster, but the data is unrecoverable at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! Now I have moved the Backup and Restore
section from the appendix of async dr doc to one of the steps while setting up RBD mirroring; so that the user is aware of taking the backup initially as one of the steps.
Also, added notes at multiple places to highlight the same
The document tracks the steps which are required to set-up rbd mirroring on clusters. Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
70c917d
to
22071af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple final nits, thanks!
namespace: rook-ceph | ||
spec: | ||
replicated: | ||
size: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this size 1 just for testing? For the example, how about 3?
size: 1 | |
size: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated 👍
Once the failed cluster is recovered on the primary site and you want to failback | ||
from secondary site, follow the below steps: | ||
|
||
* Scale down the running applications(if any) on the primary site. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these are mostly fixed except this instance :)
add a document to track the steps for failover and failback in case of Async DR; for Planned Migration and Disaster Recovery use case. Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
22071af
to
78d5a0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit and we should be good to go.
Add a new yaml for creating pools that have mirroring enabled. Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
Add a new yaml for creating volume replicationclass and volume replication cr. Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
78d5a0c
to
0f62a7c
Compare
Thanks for the suggestions! Updated 👍 |
docs: add document for failover and failback (backport #8411)
add document to track the steps for failover
and failback in case of Async DR; for Planned
Migration and Disaster Recovery use case.
Signed-off-by: Yug Gupta ygupta@redhat.com
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #7034
Checklist:
make codegen
) has been run to update object specifications, if necessary.