Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add documents for failover and failback
add documents to track the steps for failover and failback in case of Async DR; for Planned Migration and Disaster Recovery use case. Signed-off-by: Yug Gupta <ygupta@redhat.com>
- Loading branch information
1 parent
6dcc601
commit fb5c7eb
Showing
4 changed files
with
618 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,338 @@ | ||
--- | ||
title: Async Disaster Recovery Failover and Failback | ||
weight: 3245 | ||
indent: true | ||
--- | ||
|
||
# RBD Async Disaster Recovery Failover and Failback | ||
|
||
## Table of Contents <!-- omit in toc --> | ||
|
||
* [Planned Migration](#planned-migration) | ||
* [Failover](#failover) | ||
* [Failback](#failback) | ||
* [Disaster Recovery](#disaster-recovery) | ||
* [Failover](#failover-abrupt-shutdown) | ||
* [Failback](#failback-post-disaster-recovery) | ||
* [Appendix](#appendix) | ||
* [Creating a VolumeReplicationClass CR](#create-a-volume-replication-class-cr) | ||
* [Creating a VolumeReplications CR](#create-a-volumereplication-cr) | ||
* [Check VolumeReplication CR status](async-disaster-recovery.md#checking-replication-status) | ||
* [Backup and Restore](#backup-&-restore) | ||
|
||
## Planned Migration | ||
|
||
> Use cases: Datacenter maintenance, technology refresh, disaster avoidance, etc. | ||
### Failover | ||
|
||
The failover operation is the process of switching production to a | ||
backup facility (normally your recovery site). In the case of Failover, | ||
access to the image on the primary site should be stopped. | ||
The image should now be made *primary* on the secondary cluster so that | ||
the access can be resumed there. | ||
|
||
> :memo: As mentioned in the pre-requisites, periodic or one time backup of | ||
> the application should be available for restore on the secondary site (cluster-b). | ||
Follow the below steps for planned migration of workload from primary | ||
cluster to secondary cluster: | ||
|
||
* Scale down all the application pods which are using the | ||
mirrored PVC on the Primary Cluster. | ||
* [Take a backup](async-disaster-recovery.md#backup-&-restore) of PVC and PV object from the primary cluster. | ||
This can be done using some backup tools like | ||
[velero](https://velero.io/docs/main/). | ||
* [Update VolumeReplication CR](async-disaster-recovery.md#create-a-volumereplication-cr) to set `replicationState` to `secondary` at the Primary Site. | ||
When the operator sees this change, it will pass the information down to the | ||
driver via GRPC request to mark the dataSource as `secondary`. | ||
* If you are manually recreating the PVC and PV on the secondary cluster, | ||
remove the `claimRef` section in the PV objects. (See [this](async-disaster-recovery.md#restore-the-backup-on-cluster-2) for details) | ||
* Recreate the storageclass, PVC, and PV objects on the secondary site. | ||
* As you are creating the static binding between PVC and PV, a new PV won’t | ||
be created here, the PVC will get bind to the existing PV. | ||
* [Create the VolumeReplicationClass](async-disaster-recovery.md#create-a-volume-replication-class-cr) on the secondary site. | ||
* [Create VolumeReplications](async-disaster-recovery.md#create-a-volumereplication-cr) for all the PVC’s for which mirroring | ||
is enabled | ||
* `replicationState` should be `primary` for all the PVC’s on | ||
the secondary site. | ||
* [Check VolumeReplication CR status](async-disaster-recovery.md#checking-replication-status) to verify if the image is marked `primary` on the secondary site. | ||
* Once the Image is marked as `primary`, the PVC is now ready | ||
to be used. Now, we can scale up the applications to use the PVC. | ||
|
||
>:memo: **WARNING**: In Async Disaster recovery use case, we don't get | ||
> the complete data. | ||
> We will only get the crash-consistent data based on the snapshot interval time. | ||
### Failback | ||
|
||
A failback operation is a process of returning production to its | ||
original location after a disaster or a scheduled maintenance period. | ||
For a migration during steady state operation, a failback uses the | ||
same process as failover by just switching the clusters. | ||
|
||
>:memo: **Remember**: We can skip the backup-restore operations | ||
> in case of failback if the required yamls are already present on | ||
> the primary cluster. Any new PVCs will still need to be restored on the | ||
> primary site. | ||
## Disaster Recovery | ||
|
||
> Use cases: Natural disasters, Power failures, System failures, and crashes, etc. | ||
### Failover (abrupt shutdown) | ||
|
||
In case of Disaster recovery, create VolumeReplication CR at the Secondary Site. | ||
Since the connection to the Primary Site is lost, the operator automatically | ||
sends a GRPC request down to the driver to forcefully mark the dataSource as `primary` | ||
on the Secondary Site. | ||
|
||
* If you are manually creating the PVC and PV on the secondary cluster, remove | ||
the claimRef section in the PV objects. | ||
* Create the storageclass, PVC, and PV objects on the secondary site. | ||
* As you are creating the static binding between PVC and PV, a new PV won’t be | ||
created here, the PVC will get bind to the existing PV. | ||
* [Create the VolumeReplicationClass](async-disaster-recovery.md#create-a-volume-replication-class-cr) and [VolumeReplication CR](async-disaster-recovery.md#create-a-volumereplication-cr) on the secondary site. | ||
* [Check VolumeReplication CR status](async-disaster-recovery.md#checking-replication-status) to verify if the image is marked `primary` on the secondary site. | ||
* Once the Image is marked as `primary`, the PVC is now ready to be used. Now, | ||
we can scale up the applications to use the PVC. | ||
|
||
### Failback (post-disaster recovery) | ||
|
||
Once the failed cluster is recovered on the primary site and you want to failback | ||
from secondary site, follow the below steps: | ||
|
||
* Scale down the running applications(if any) on the primary site. | ||
Ensure that all persistent volumes in use by the workload are no | ||
longer in use on the primary cluster. | ||
* [Update VolumeReplication CR](async-disaster-recovery.md#create-a-volumereplication-cr) replicationState | ||
from `primary` to `secondary` on the primary site. | ||
* Scale down the applications on the secondary site. | ||
* [Update VolumeReplication CR](async-disaster-recovery.md#create-a-volumereplication-cr) replicationState state from `primary` to | ||
`secondary` in secondary site. | ||
* On the primary site, [verify the VolumeReplication status](async-disaster-recovery.md#checking-replication-status) is marked as | ||
volume ready to use. | ||
* Once the volume is marked to ready to use, change the replicationState state | ||
from `secondary` to `primary` in primary site. | ||
* Scale up the applications again on the primary site. | ||
|
||
## Appendix | ||
|
||
Below guide assumes that we have a PVC (rbd-pvc) in BOUND state; created using | ||
*StorageClass* with `Retain` reclaimPolicy. | ||
|
||
```bash | ||
kubectl get pvc --context=cluster-1 | ||
``` | ||
|
||
> | ||
> ```bash | ||
> NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE | ||
> rbd-pvc Bound pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec 1Gi RWO csi-rbd-sc 44s | ||
> ``` | ||
### Create a Volume Replication Class CR | ||
|
||
In this case, we create a Volume Replication Class on cluster-1 () | ||
|
||
```yaml | ||
cat <<EOF | kubectl --context=cluster-1 apply -f - | ||
apiVersion: replication.storage.openshift.io/v1alpha1 | ||
kind: VolumeReplicationClass | ||
metadata: | ||
name: rbd-volumereplicationclass | ||
spec: | ||
provisioner: rook-ceph.rbd.csi.ceph.com | ||
parameters: | ||
mirroringMode: snapshot | ||
schedulingInterval: "12m" | ||
schedulingStartTime: "16:18:43" | ||
replication.storage.openshift.io/replication-secret-name: rook-csi-rbd-provisioner | ||
replication.storage.openshift.io/replication-secret-namespace: rook-ceph | ||
EOF | ||
``` | ||
|
||
>:bulb: **Note:** The `schedulingInterval` can be specified in formats of | ||
> minutes, hours or days using suffix `m`,`h` and `d` respectively. | ||
> The optional schedulingStartTime can be specified using the ISO 8601 | ||
> time format. | ||
### Create a VolumeReplication CR | ||
|
||
* Once VolumeReplicationClass is created, create a Volume Replication for | ||
the PVC which we intend to replicate to secondary cluster. | ||
|
||
```yaml | ||
cat <<EOF | kubectl --context=cluster-1 apply -f - | ||
apiVersion: replication.storage.openshift.io/v1alpha1 | ||
kind: VolumeReplication | ||
metadata: | ||
name: pvc-volumereplication | ||
spec: | ||
volumeReplicationClass: rbd-volumereplicationclass | ||
replicationState: primary | ||
dataSource: | ||
apiGroup: "" | ||
kind: PersistentVolumeClaim | ||
name: rbd-pvc # Name of the PVC on which mirroring is to be enabled. | ||
EOF | ||
``` | ||
|
||
>:memo: *VolumeReplication* is a namespace scoped object. Thus, | ||
> it should be created in the same namespace as of PVC. | ||
### Checking Replication Status | ||
|
||
`replicationState` is the state of the volume being referenced. | ||
Possible values are primary, secondary, and resync. | ||
|
||
* `primary` denotes that the volume is primary. | ||
* `secondary` denotes that the volume is secondary. | ||
* `resync` denotes that the volume needs to be resynced. | ||
|
||
To check VolumeReplication CR status: | ||
|
||
```bash | ||
kubectl get volumereplication pvc-volumereplication --context=cluster-1 -oyaml | ||
``` | ||
|
||
>```yaml | ||
>... | ||
>spec: | ||
> dataSource: | ||
> apiGroup: "" | ||
> kind: PersistentVolumeClaim | ||
> name: rbd-pvc | ||
> replicationState: primary | ||
> volumeReplicationClass: rbd-volumereplicationclass | ||
>status: | ||
> conditions: | ||
> - lastTransitionTime: "2021-05-04T07:39:00Z" | ||
> message: "" | ||
> observedGeneration: 1 | ||
> reason: Promoted | ||
> status: "True" | ||
> type: Completed | ||
> - lastTransitionTime: "2021-05-04T07:39:00Z" | ||
> message: "" | ||
> observedGeneration: 1 | ||
> reason: Healthy | ||
> status: "False" | ||
> type: Degraded | ||
> - lastTransitionTime: "2021-05-04T07:39:00Z" | ||
> message: "" | ||
> observedGeneration: 1 | ||
> reason: NotResyncing | ||
> status: "False" | ||
> type: Resyncing | ||
> lastCompletionTime: "2021-05-04T07:39:00Z" | ||
> lastStartTime: "2021-05-04T07:38:59Z" | ||
> message: volume is marked primary | ||
> observedGeneration: 1 | ||
> state: Primary | ||
>``` | ||
### Backup & Restore | ||
Here, we take a backup of PVC and PV object on one site, so that they can be restored later to the peer cluster. | ||
#### **Take backup on cluster-1** | ||
* Take backup of the PVC `rbd-pvc` | ||
```bash | ||
kubectl --context=cluster-1 get pvc rbd-pvc -oyaml > pvc-backup.yaml | ||
``` | ||
* Take a backup of the PV, corresponding to the PVC | ||
```bash | ||
kubectl --context=cluster-1 get pv/pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec -oyaml > pv_backup.yaml | ||
``` | ||
>:bulb: We can also take backup using external tools like **Velero**. | ||
> Refer [velero documentation](https://velero.io/docs/main/) for more information. | ||
#### **Restore the backup on cluster-2** | ||
* Create storageclass on the secondary cluster | ||
```bash | ||
kubectl create -f examples/rbd/storageclass.yaml --context=cluster-2 | ||
``` | ||
> ```bash | ||
> storageclass.storage.k8s.io/csi-rbd-sc created | ||
> ``` | ||
* Create VolumeReplicationClass on the secondary cluster | ||
```bash | ||
cat <<EOF | kubectl --context=cluster-2 apply -f - | ||
apiVersion: replication.storage.openshift.io/v1alpha1 | ||
kind: VolumeReplicationClass | ||
metadata: | ||
name: rbd-volumereplicationclass | ||
spec: | ||
provisioner: rook-ceph.rbd.csi.ceph.com | ||
parameters: | ||
mirroringMode: snapshot | ||
replication.storage.openshift.io/replication-secret-name: rook-csi-rbd-provisioner | ||
replication.storage.openshift.io/replication-secret-namespace: rook-ceph | ||
EOF | ||
``` | ||
> ```bash | ||
> volumereplicationclass.replication.storage.openshift.io/rbd-volumereplicationclass created | ||
> ``` | ||
* If Persistent Volumes and Claims are created manually on the secondary cluster, | ||
remove the `claimRef` on the backed up PV objects in yaml files; so that the | ||
PV can get bound to the new claim on the secondary cluster. | ||
```yaml | ||
... | ||
spec: | ||
accessModes: | ||
- ReadWriteOnce | ||
capacity: | ||
storage: 1Gi | ||
claimRef: | ||
apiVersion: v1 | ||
kind: PersistentVolumeClaim | ||
name: rbd-pvc | ||
namespace: default | ||
resourceVersion: "64252" | ||
uid: 65dc0aac-5e15-4474-90f4-7a3532c621ec | ||
csi: | ||
... | ||
``` | ||
* Apply the Persistent Volume backup from the primary cluster | ||
```bash | ||
kubectl create -f pv-backup.yaml --context=cluster-2 | ||
``` | ||
> ```bash | ||
> persistentvolume/pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec created | ||
> ``` | ||
* Apply the Persistent Volume claim from the restored backup | ||
```bash | ||
kubectl create -f pvc-backup.yaml --context=cluster-2 | ||
``` | ||
> ```bash | ||
> persistentvolumeclaim/rbd-pvc created | ||
> ``` | ||
```bash | ||
kubectl get pvc --context=cluster-2 | ||
``` | ||
> ```bash | ||
> NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE | ||
> rbd-pvc Bound pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec 1Gi RWO rook-ceph-block 44s | ||
> ``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
--- | ||
title: Disaster Recovery Overview | ||
weight: 3240 | ||
--- | ||
|
||
# Planned Migration and Disaster Recovery with Rook | ||
|
||
Rook v1.6.0 comes with the new volume replication support and Ceph-CSI v3.3.0, which allows users to perform disaster recovery and planned migration of clusters. | ||
|
||
The following documents will help to configure the clusters, as well as track the procedure for failover and failback in case of a Disaster recovery or Planned migration use cases: | ||
|
||
* [Configuring clusters with DR](rbd-mirroring.md): A pod from which you can run all of the tools to troubleshoot the storage cluster | ||
* [Async DR Failover and Failback Steps](async-disaster-recovery.md): Common issues and their potential solutions |
Oops, something went wrong.