docs: add documents for failover and failback

add documents to track the steps for failover and failback in case of Async DR; for Planned Migration and Disaster Recovery use case. Signed-off-by: Yug Gupta <ygupta@redhat.com>
rook · Aug 2, 2021 · fb5c7eb · fb5c7eb
1 parent 6dcc601
commit fb5c7eb
Show file tree

Hide file tree

Showing 4 changed files with 618 additions and 0 deletions.
diff --git a/Documentation/async-disaster-recovery.md b/Documentation/async-disaster-recovery.md
@@ -0,0 +1,338 @@
+---
+title: Async Disaster Recovery Failover and Failback
+weight: 3245
+indent: true
+---
+
+# RBD Async Disaster Recovery Failover and Failback
+
+## Table of Contents <!-- omit in toc -->
+
+* [Planned Migration](#planned-migration)
+  * [Failover](#failover)
+  * [Failback](#failback)
+* [Disaster Recovery](#disaster-recovery)
+  * [Failover](#failover-abrupt-shutdown)
+  * [Failback](#failback-post-disaster-recovery)
+* [Appendix](#appendix)
+  * [Creating a VolumeReplicationClass CR](#create-a-volume-replication-class-cr)
+  * [Creating a VolumeReplications CR](#create-a-volumereplication-cr)
+  * [Check VolumeReplication CR status](async-disaster-recovery.md#checking-replication-status)
+  * [Backup and Restore](#backup-&-restore)
+
+## Planned Migration
+
+> Use cases: Datacenter maintenance, technology refresh, disaster avoidance, etc.
+
+### Failover
+
+The failover operation is the process of switching production to a
+ backup facility (normally your recovery site). In the case of Failover,
+ access to the image on the primary site should be stopped.
+The image should now be made *primary* on the secondary cluster so that
+ the access can be resumed there.
+
+> :memo: As mentioned in the pre-requisites, periodic or one time backup of
+> the application should be available for restore on the secondary site (cluster-b).
+
+Follow the below steps for planned migration of workload from primary
+ cluster to secondary cluster:
+
+* Scale down all the application pods which are using the
+ mirrored PVC on the Primary Cluster.
+* [Take a backup](async-disaster-recovery.md#backup-&-restore) of PVC and PV object from the primary cluster.
+ This can be done using some backup tools like
+ [velero](https://velero.io/docs/main/).
+* [Update VolumeReplication CR](async-disaster-recovery.md#create-a-volumereplication-cr) to set `replicationState` to `secondary` at the Primary Site.
+ When the operator sees this change, it will pass the information down to the
+  driver via GRPC request to mark the dataSource as `secondary`.
+* If you are manually recreating the PVC and PV on the secondary cluster,
+ remove the `claimRef` section in the PV objects. (See [this](async-disaster-recovery.md#restore-the-backup-on-cluster-2) for details)
+* Recreate the storageclass, PVC, and PV objects on the secondary site.
+* As you are creating the static binding between PVC and PV, a new PV won’t
+ be created here, the PVC will get bind to the existing PV.
+* [Create the VolumeReplicationClass](async-disaster-recovery.md#create-a-volume-replication-class-cr) on the secondary site.
+* [Create VolumeReplications](async-disaster-recovery.md#create-a-volumereplication-cr) for all the PVC’s for which mirroring
+ is enabled
+  * `replicationState` should be `primary` for all the PVC’s on
+   the secondary site.
+* [Check VolumeReplication CR status](async-disaster-recovery.md#checking-replication-status) to verify if the image is marked `primary` on the secondary site.
+* Once the Image is marked as `primary`, the PVC is now ready
+ to be used. Now, we can scale up the applications to use the PVC.
+
+>:memo: **WARNING**: In Async Disaster recovery use case, we don't get
+> the complete data.
+> We will only get the crash-consistent data based on the snapshot interval time.
+
+### Failback
+
+A failback operation is a process of returning production to its
+ original location after a disaster or a scheduled maintenance period.
+ For a migration during steady state operation, a failback uses the
+ same process as failover by just switching the clusters.
+
+>:memo: **Remember**: We can skip the backup-restore operations
+> in case of failback if the required yamls are already present on
+> the primary cluster. Any new PVCs will still need to be restored on the
+> primary site.
+
+## Disaster Recovery
+
+> Use cases: Natural disasters, Power failures, System failures, and crashes, etc.
+
+### Failover (abrupt shutdown)
+
+In case of Disaster recovery, create VolumeReplication CR at the Secondary Site.
+ Since the connection to the Primary Site is lost, the operator automatically
+ sends a GRPC request down to the driver to forcefully mark the dataSource as `primary`
+ on the Secondary Site.
+
+* If you are manually creating the PVC and PV on the secondary cluster, remove
+ the claimRef section in the PV objects.
+* Create the storageclass, PVC, and PV objects on the secondary site.
+* As you are creating the static binding between PVC and PV, a new PV won’t be
+ created here, the PVC will get bind to the existing PV.
+* [Create the VolumeReplicationClass](async-disaster-recovery.md#create-a-volume-replication-class-cr) and [VolumeReplication CR](async-disaster-recovery.md#create-a-volumereplication-cr) on the secondary site.
+* [Check VolumeReplication CR status](async-disaster-recovery.md#checking-replication-status) to verify if the image is marked `primary` on the secondary site.
+* Once the Image is marked as `primary`, the PVC is now ready to be used. Now,
+ we can scale up the applications to use the PVC.
+
+### Failback (post-disaster recovery)
+
+Once the failed cluster is recovered on the primary site and you want to failback
+ from secondary site, follow the below steps:
+
+* Scale down the running applications(if any) on the primary site.
+ Ensure that all persistent volumes in use by the workload are no
+ longer in use on the primary cluster.
+* [Update VolumeReplication CR](async-disaster-recovery.md#create-a-volumereplication-cr) replicationState
+ from `primary` to `secondary` on the primary site.
+* Scale down the applications on the secondary site.
+* [Update VolumeReplication CR](async-disaster-recovery.md#create-a-volumereplication-cr) replicationState state from `primary` to
+ `secondary` in secondary site.
+* On the primary site, [verify the VolumeReplication status](async-disaster-recovery.md#checking-replication-status) is marked as
+ volume ready to use.
+* Once the volume is marked to ready to use, change the replicationState state
+ from `secondary` to `primary` in primary site.
+* Scale up the applications again on the primary site.
+
+## Appendix
+
+Below guide assumes that we have a PVC (rbd-pvc) in BOUND state; created using
+ *StorageClass* with `Retain` reclaimPolicy.
+
+```bash
+kubectl get pvc --context=cluster-1
+```
+
+>
+> ```bash
+> NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
+> rbd-pvc   Bound    pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec   1Gi        RWO            csi-rbd-sc   44s
+> ```
+
+### Create a Volume Replication Class CR
+
+In this case, we create a Volume Replication Class on cluster-1 ()
+
+```yaml
+  cat <<EOF | kubectl --context=cluster-1 apply -f -
+  apiVersion: replication.storage.openshift.io/v1alpha1
+  kind: VolumeReplicationClass
+  metadata:
+    name: rbd-volumereplicationclass
+  spec:
+    provisioner: rook-ceph.rbd.csi.ceph.com
+    parameters:
+      mirroringMode: snapshot
+      schedulingInterval: "12m"
+      schedulingStartTime: "16:18:43"
+      replication.storage.openshift.io/replication-secret-name: rook-csi-rbd-provisioner
+      replication.storage.openshift.io/replication-secret-namespace: rook-ceph
+  EOF
+```
+
+>:bulb: **Note:** The `schedulingInterval` can be specified in formats of
+> minutes, hours or days using suffix `m`,`h` and `d` respectively.
+> The optional schedulingStartTime can be specified using the ISO 8601
+> time format.
+
+### Create a VolumeReplication CR
+
+* Once VolumeReplicationClass is created, create a Volume Replication for
+ the PVC which we intend to replicate to secondary cluster.
+
+```yaml
+  cat <<EOF | kubectl --context=cluster-1 apply -f -
+  apiVersion: replication.storage.openshift.io/v1alpha1
+  kind: VolumeReplication
+  metadata:
+    name: pvc-volumereplication
+  spec:
+    volumeReplicationClass: rbd-volumereplicationclass
+    replicationState: primary
+    dataSource:
+      apiGroup: ""
+      kind: PersistentVolumeClaim
+      name: rbd-pvc # Name of the PVC on which mirroring is to be enabled.
+  EOF
+```
+
+>:memo: *VolumeReplication* is a namespace scoped object. Thus,
+> it should be created in the same namespace as of PVC.
+
+### Checking Replication Status
+
+`replicationState` is the state of the volume being referenced.
+ Possible values are primary, secondary, and resync.
+
+* `primary` denotes that the volume is primary.
+* `secondary` denotes that the volume is secondary.
+* `resync` denotes that the volume needs to be resynced.
+
+To check VolumeReplication CR status:
+
+```bash
+kubectl get volumereplication pvc-volumereplication  --context=cluster-1 -oyaml
+```
+
+>```yaml
+>...
+>spec:
+>  dataSource:
+>    apiGroup: ""
+>    kind: PersistentVolumeClaim
+>    name: rbd-pvc
+>  replicationState: primary
+>  volumeReplicationClass: rbd-volumereplicationclass
+>status:
+>  conditions:
+>  - lastTransitionTime: "2021-05-04T07:39:00Z"
+>    message: ""
+>    observedGeneration: 1
+>    reason: Promoted
+>    status: "True"
+>    type: Completed
+>  - lastTransitionTime: "2021-05-04T07:39:00Z"
+>    message: ""
+>    observedGeneration: 1
+>    reason: Healthy
+>    status: "False"
+>    type: Degraded
+>  - lastTransitionTime: "2021-05-04T07:39:00Z"
+>    message: ""
+>    observedGeneration: 1
+>    reason: NotResyncing
+>    status: "False"
+>    type: Resyncing
+>  lastCompletionTime: "2021-05-04T07:39:00Z"
+>  lastStartTime: "2021-05-04T07:38:59Z"
+>  message: volume is marked primary
+>  observedGeneration: 1
+>  state: Primary
+>```
+
+### Backup & Restore
+
+Here, we take a backup of PVC and PV object on one site, so that they can be restored later to the peer cluster.
+
+#### **Take backup on cluster-1**
+
+* Take backup of the PVC `rbd-pvc`
+
+```bash
+kubectl --context=cluster-1 get pvc rbd-pvc -oyaml > pvc-backup.yaml
+```
+
+* Take a backup of the PV, corresponding to the PVC
+
+```bash
+kubectl --context=cluster-1 get pv/pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec -oyaml > pv_backup.yaml
+```
+
+>:bulb: We can also take backup using external tools like **Velero**.
+> Refer [velero documentation](https://velero.io/docs/main/) for more information.
+
+#### **Restore the backup on cluster-2**
+
+* Create storageclass on the secondary cluster
+
+```bash
+ kubectl create -f examples/rbd/storageclass.yaml --context=cluster-2
+```
+
+> ```bash
+> storageclass.storage.k8s.io/csi-rbd-sc created
+> ```
+
+* Create VolumeReplicationClass on the secondary cluster
+
+```bash
+  cat <<EOF | kubectl --context=cluster-2 apply -f -
+  apiVersion: replication.storage.openshift.io/v1alpha1
+  kind: VolumeReplicationClass
+  metadata:
+    name: rbd-volumereplicationclass
+  spec:
+    provisioner: rook-ceph.rbd.csi.ceph.com
+    parameters:
+      mirroringMode: snapshot
+      replication.storage.openshift.io/replication-secret-name: rook-csi-rbd-provisioner
+      replication.storage.openshift.io/replication-secret-namespace: rook-ceph
+  EOF
+ ```
+
+> ```bash
+> volumereplicationclass.replication.storage.openshift.io/rbd-volumereplicationclass created
+> ```
+
+* If Persistent Volumes and Claims are created manually on the secondary cluster,
+     remove the `claimRef` on the backed up PV objects in yaml files; so that the
+     PV can get bound to the new claim on the secondary cluster.
+
+```yaml
+  ...
+  spec:
+      accessModes:
+      - ReadWriteOnce
+      capacity:
+        storage: 1Gi
+      claimRef:
+        apiVersion: v1
+        kind: PersistentVolumeClaim
+        name: rbd-pvc
+        namespace: default
+        resourceVersion: "64252"
+        uid: 65dc0aac-5e15-4474-90f4-7a3532c621ec
+      csi:
+  ...
+```
+
+* Apply the Persistent Volume backup from the primary cluster
+
+```bash
+    kubectl create -f pv-backup.yaml --context=cluster-2
+```
+
+> ```bash
+> persistentvolume/pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec created
+> ```
+
+* Apply the Persistent Volume claim from the restored backup
+
+```bash
+    kubectl create -f pvc-backup.yaml --context=cluster-2
+```
+
+> ```bash
+> persistentvolumeclaim/rbd-pvc created
+> ```
+
+```bash
+  kubectl get pvc --context=cluster-2
+```
+
+> ```bash
+> NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
+> rbd-pvc   Bound    pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec   1Gi        RWO            rook-ceph-block   44s
+> ```
diff --git a/Documentation/planned-migration-and-dr.md b/Documentation/planned-migration-and-dr.md
@@ -0,0 +1,13 @@
+---
+title: Disaster Recovery Overview
+weight: 3240
+---
+
+# Planned Migration and Disaster Recovery with Rook
+
+Rook v1.6.0 comes with the new volume replication support and Ceph-CSI v3.3.0, which allows users to perform disaster recovery and planned migration of clusters.
+
+The following documents will help to configure the clusters, as well as track the procedure for failover and failback in case of a Disaster recovery or Planned migration use cases:
+
+* [Configuring clusters with DR](rbd-mirroring.md): A pod from which you can run all of the tools to troubleshoot the storage cluster
+* [Async DR Failover and Failback Steps](async-disaster-recovery.md): Common issues and their potential solutions