diff --git a/Documentation/async-disaster-recovery.md b/Documentation/async-disaster-recovery.md new file mode 100644 index 0000000000000..8b991dcf5f6a8 --- /dev/null +++ b/Documentation/async-disaster-recovery.md @@ -0,0 +1,460 @@ +# Failover and Failback in RBD Async Disaster Recovery + +[RBD mirroring](https://docs.ceph.com/en/latest/rbd/rbd-mirroring/) + is an asynchronous replication of RBD images between multiple Ceph clusters. + This capability is available in two modes: + +* Journal-based: Every write to the RBD image is first recorded + to the associated journal before modifying the actual image. + The remote cluster will read from this associated journal and + replay the updates to its local image. +* Snapshot-based: This mode uses periodically scheduled or + manually created RBD image mirror-snapshots to replicate + crash-consistent RBD images between clusters. + +## Create RBD Pools + +In this section we create specific rbd pools that are RBD mirroring + enabled for use with the DR use case. + +> :memo: **Note:** It is also feasible to edit existing pools and +> enable them for replication. + +Execute the following steps on each peer cluster to create mirror + enabled pools: + +* Create a RBD pool that is mirroring enabled by adding the section + spec.mirroring in the CephBlockPool CR, example: + + ```yaml + apiVersion: ceph.rook.io/v1 + kind: CephBlockPool + metadata: + name: replicapool + namespace: rook-ceph + spec: + replicated: + size: 1 + mirroring: + enabled: true + mode: image + # schedule(s) of snapshot + snapshotSchedules: + - interval: 24h # daily snapshots + startTime: 14:00:00-05:00 + ``` + +* Creating `replicapool` pool on the cluster: + + ```bash + kubectl apply -f pool.yaml -n rook-ceph + ``` + + > ```bash + > cephblockpool.ceph.rook.io/replicapool created + > ``` + +* Repeat the steps on the peer cluster. + +> :memo: WARNING: Pool name across the cluster peers should be the same +> for RBD replication to function. + +For more information on CephBlockPool CR, please refer to the [ceph-pool-crd documentation](Documentation/ceph-pool-crd.md#mirroring). + +## Bootstrap Peers + +In order for the rbd-mirror daemon to discover its peer cluster, the + peer must be registered and a user account must be created. The following + steps enable bootstrapping peers to discover and authenticate to each other. + +For more details, refer to the official rbd mirror documentation on + [how to create a bootstrap peer](https://docs.ceph.com/en/latest/rbd/rbd-mirroring/#bootstrap-peers). + +## Create RBDMirror CRD + +Replication is handled by the rbd-mirror daemon. The rbd-mirror daemon + is responsible for pulling image updates from the remote, peer cluster, + and applying them to image within the local cluster. + +Creation of the rbd-mirror daemon(s) is done through the custom resource + definitions (CRDs). + For more information on how to set up Ceph RBDMirror CRD, refer to + [rook documentation](https://rook.io/docs/rook/master/ceph-rbd-mirror-crd.html). + +## Enable OMAP Generator and Volume Replication + +### OMap Generator + +Omap generator is a sidecar container that when deployed with + the CSI provisioner pod, generates the internal CSI omaps between + the PV and the RBD image. This is required as static PVs are + transferred across peer clusters in the DR use case, and hence + is needed to preserve PVC to storage mappings. + +### Volume Replication + +Volume Replication Operator is a kubernetes operator that provides common + and reusable APIs for storage disaster recovery. + It is based on [csi-addons/spec](https://github.com/csi-addons/spec) + specification and can be used by any storage provider. + +Volume Replication Operator follows controller pattern and provides + extended APIs for storage disaster recovery. + The extended APIs are provided via Custom Resource Definition (CRD). + + Rook v1.6.0 comes with the new volume replication support, and + this can be enabled in the `rook-ceph-operator-config` configmap. + +>:bulb: For more information, please refer to the +> [volume-replication-operator](https://github.com/csi-addons/volume-replication-operator). + +### Deploy `csi-omap-generator` and `volume-replication` sidecars + +To achieve RBD Mirroring, `csi-omap-generator` and `volume-replication` + containers need to be deployed in the RBD provisioner pods. +Execute the following steps on each peer cluster to enable the + OMap generator and Volume Replication sidecars: + +* Edit the `rook-ceph-operator-config` configmap and add the + following configurations + + ```bash + kubectl edit cm rook-ceph-operator-config -nrook-ceph + ``` + + Add the following configuration if not present + + ```yaml + apiVersion: v1 + data: + CSI_ENABLE_OMAP_GENERATOR: "true" + CSI_ENABLE_VOLUME_REPLICATION: "true" + ``` + +* After successful execution of the above steps, two new sidecars + should now come up in the CSI provisioner pod. +* Repeat the steps on the peer cluster. + +## VolumeReplicationClass and VolumeReplication + +### VolumeReplicationClass + +*VolumeReplicationClass* is a cluster scoped resource that contains + driver related configuration parameters. It holds the storage admin + information required for the volume replication operator. + +### VolumeReplication + +*VolumeReplication* is a namespaced resource that contains references + to storage object to be replicated and VolumeReplicationClass + corresponding to the driver providing replication. + +>:bulb: For more information, please refer to the +> [volume-replication-operator](https://github.com/csi-addons/volume-replication-operator). + +Let's say we have a *PVC* (rbd-pvc) in BOUND state; created using + *StorageClass* with `Retain` reclaimPolicy. + +```bash +kubectl get pvc --context=cluster-1 +``` + +> +> ```bash +> NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE +> rbd-pvc Bound pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec 1Gi RWO csi-rbd-sc 44s +> ``` + +* Create Volume Replication Class on cluster-1 + + ```yaml + $cat <:bulb: **Note:** The `schedulingInterval` can be specified in formats of +> minutes, hours or days using suffix `m`,`h` and `d` respectively. +> The optional schedulingStartTime can be specified using the ISO 8601 +> time format. + +* Once VolumeReplicationClass is created, create a Volume Replication for + the PVC which we intend to replicate to secondary cluster. + + ```yaml + $cat <:memo: *VolumeReplication* is a namespace scoped object. Thus, +> it should be created in the same namespace as of PVC. + +`replicationState` is the state of the volume being referenced. + Possible values are primary, secondary, and resync. + +* `primary` denotes that the volume is primary. +* `secondary` denotes that the volume is secondary. +* `resync` denotes that the volume needs to be resynced. + +To check VolumeReplication CR status: + +```bash +kubectl get volumereplication pvc-volumereplication --context=cluster-1 -oyaml +``` + +>```yaml +>... +>spec: +> dataSource: +> apiGroup: "" +> kind: PersistentVolumeClaim +> name: rbd-pvc +> replicationState: primary +> volumeReplicationClass: rbd-volumereplicationclass +>status: +> conditions: +> - lastTransitionTime: "2021-05-04T07:39:00Z" +> message: "" +> observedGeneration: 1 +> reason: Promoted +> status: "True" +> type: Completed +> - lastTransitionTime: "2021-05-04T07:39:00Z" +> message: "" +> observedGeneration: 1 +> reason: Healthy +> status: "False" +> type: Degraded +> - lastTransitionTime: "2021-05-04T07:39:00Z" +> message: "" +> observedGeneration: 1 +> reason: NotResyncing +> status: "False" +> type: Resyncing +> lastCompletionTime: "2021-05-04T07:39:00Z" +> lastStartTime: "2021-05-04T07:38:59Z" +> message: volume is marked primary +> observedGeneration: 1 +> state: Primary +>``` + +* Take a backup of PVC and PV object on primary cluster(cluster-1) + + * Take backup of the PVC `rbd-pvc` + + ```bash + kubectl get pvc rbd-pvc -oyaml >pvc-backup.yaml + ``` + + * Take a backup of the PV, corresponding to the PVC + + ```bash + kubectl get pv/pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec -oyaml >pv_backup.yaml + ``` + +>:bulb: We can also take backup using external tools like **Velero**. +> Refer [velero documentation](https://velero.io/docs/main/) for more information. + +* Restoring on the secondary cluster(cluster-2) + + * Create storageclass on the secondary cluster + + ```bash + kubectl create -f examples/rbd/storageclass.yaml --context=cluster-2 + ``` + + > ```bash + > storageclass.storage.k8s.io/csi-rbd-sc created + > ``` + + * Create VolumeReplicationClass on the secondary cluster + + ```bash + cat < ```bash + > volumereplicationclass.replication.storage.openshift.io/rbd-volumereplicationclass created + > ``` + + * If Persistent Volumes and Claims are created manually on the secondary cluster, + remove the `claimRef` on the backed up PV objects in yaml files; so that the + PV can get bound to the new claim on the secondary cluster. + + ```yaml + ... + spec: + accessModes: + - ReadWriteOnce + capacity: + storage: 1Gi + claimRef: + apiVersion: v1 + kind: PersistentVolumeClaim + name: rbd-pvc + namespace: default + resourceVersion: "64252" + uid: 65dc0aac-5e15-4474-90f4-7a3532c621ec + csi: + ... + ``` + +* Apply the Persistent Volume backup from the primary cluster + +```bash + kubectl create -f pv-backup.yaml --context=cluster-2 +``` + +> ```bash +> persistentvolume/pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec created +> ``` + +* Apply the Persistent Volume claim from the restored backup + +```bash + kubectl create -f pvc-backup.yaml --context=cluster-2 +``` + +> ```bash +> persistentvolumeclaim/rbd-pvc created +> ``` + +```bash + kubectl get pvc --context=cluster-2 +``` + +> ```bash +> NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE +> rbd-pvc Bound pvc-65dc0aac-5e15-4474-90f4-7a3532c621ec 1Gi RWO csi-rbd-sc 44s +> ``` + +## Planned Migration + +> Use cases: Datacenter maintenance, technology refresh, disaster avoidance, etc. + +### Failover + +The failover operation is the process of switching production to a + backup facility (normally your recovery site). In the case of Failover, + access to the image on the primary site should be stopped. +The image should now be made *primary* on the secondary cluster so that + the access can be resumed there. + +:memo: As mentioned in the pre-requisites, periodic or one time backup of + the application should be available for restore on the secondary site (cluster-b). + +Follow the below steps for planned migration of workload from primary + cluster to secondary cluster: + +* Scale down all the application pods which are using the + mirrored PVC on the Primary Cluster +* Take a back up of PVC and PV object from the primary cluster. + This can be done using some backup tools like + [velero](https://velero.io/docs/main/). +* Update `replicationState` to `secondary` in VolumeReplication CR at Primary Site. + When the operator sees this change, it will pass the information down to the + driver via GRPC request to mark the dataSource as `secondary`. +* If you are manually recreating the PVC and PV on the secondary cluster, + remove the `claimRef` section in the PV objects. +* Recreate the storageclass, PVC, and PV objects on the secondary site. +* As you are creating the static binding between PVC and PV, a new PV won’t + be created here, the PVC will get bind to the existing PV. +* Create the VolumeReplicationClass on the secondary site. +* Create the VolumeReplications for all the PVC’s for which mirroring + is enabled + * `replicationState` should be `primary` for all the PVC’s on + the secondary site. +* Check whether the image is marked `primary` on the secondary site + by verifying it in VolumeReplication CR status. +* Once the Image is marked as `primary`, the PVC is now ready + to be used. Now, we can scale up the applications to use the PVC. + +>:memo: **WARNING**: In Async Disaster recovery use case, we don't get +> the complete data. +> We will only get the crash-consistent data based on the snapshot interval time. + +### Failback + +A failback operation is a process of returning production to its + original location after a disaster or a scheduled maintenance period. + For a migration during steady state operation, a failback uses the + same process as failover by just switching the clusters. + +>:memo: **Remember**: We can skip the backup-restore operations +> in case of failback if the required yamls are already present on +> the primary cluster. Any new PVCs will still need to be restored on the +> primary site. + +## Disaster Recovery + +> Use cases: Natural disasters, Power failures, System failures, and crashes, etc. + +### Failover (abrupt shutdown) + +In case of Disaster recovery, create VolumeReplication CR at the Secondary Site. + Since the connection to the Primary Site is lost, the operator automatically + sends a GRPC request down to the driver to forcefully mark the dataSource as `primary` + on the Secondary Site. + +* If you are manually creating the PVC and PV on the secondary cluster, remove + the claimRef section in the PV objects. +* Create the storageclass, PVC, and PV objects on the secondary site. +* As you are creating the static binding between PVC and PV, a new PV won’t be + created here, the PVC will get bind to the existing PV. +* Create the VolumeReplicationClass and VolumeReplication CR on the secondary site. +* Check whether the image is `primary` on secondary site, by verifying in + the VolumeReplication CR status. +* Once the Image is marked as `primary`, the PVC is now ready to be used. Now, + we can scale up the applications to use the PVC. + +### Failback (post-disaster recovery) + +Once the failed cluster is recovered on the primary site and you want to failback + from secondary site, follow the below steps: + +* Scale down the running applications(if any) on the primary site. + Ensure that all persistent volumes in use by the workload are no + longer in use on the primary cluster. +* Update the VolumeReplication CR replicationState + from `primary` to `secondary` on the primary site. +* Scale down the applications on the secondary site. +* Update the VolumeReplication CR replicationState state from `primary` to + `secondary` in secondary site. +* On the primary site, verify that the VolumeReplication status is marked as + volume ready to use +* Once the volume is marked to ready to use, change the replicationState state + from `secondary` to `primary` in primary site. +* Scale up the applications again on the primary site.