Merge pull request rook#9643 from BlaineEXE/multus-csi-design

docs: add CSI section to Multus design doc
weirdwiz · Feb 4, 2022 · ba7318c · ba7318c
2 parents 5e03880 + 2bf1e1f
commit ba7318c
Show file tree

Hide file tree

Showing 2 changed files with 120 additions and 4 deletions.
diff --git a/Documentation/ceph-cluster-crd.md b/Documentation/ceph-cluster-crd.md
@@ -396,7 +396,7 @@ spec:
   * In Openshift, to use a NetworkAttachmentDefinition (NAD) across namespaces, the NAD must be deployed in the `default` namespace. The NAD is then referenced with the namespace: `default/rook-public-nw`
 
 #### Known issues with multus
-When a CephFS/RBD volume is mounted in a Pod using cephcsi and then the CSI CephFS/RBD plugin is restarted or terminated (e.g. by restarting or deleting its DaemonSet), all operations on the volume become blocked, even after restarting the CSI pods. The only workaround is to restart the node where the cephcsi plugin pod was restarted.
+When a CephFS/RBD volume is mounted in a Pod using Ceph CSI and then the CSI CephFS/RBD plugin is restarted or terminated (e.g. by restarting or deleting its DaemonSet), all operations on the volume become blocked, even after restarting the CSI pods. The only workaround is to restart the node where the Ceph CSI plugin pod was restarted.
 This issue is tracked [here](https://github.com/rook/rook/issues/8085).
 
 #### IPFamily

diff --git a/design/ceph/multus-network.md b/design/ceph/multus-network.md
@@ -3,9 +3,9 @@
 We have already explored and explained the benefit of multi-homed networking, so this document will not rehearse that but simply focus on the implementation for the Ceph backend.
 If you are interested in learning more about multi-homed networking you can read the [design documentation on that matter](../core/multi-homed-cluster.md).
 
-To make the story short, [Multus](https://github.com/intel/multus-cni) should allow us to get the same performance benefit as `HostNetworking` by increasing the security.
+To make the story short, [Multus](https://github.com/intel/multus-cni) should allow us to get the same performance benefit as `HostNetworking` as well as increasing the security.
 Using `HostNetworking` results in exposing **all** the network interfaces (the entire stack) of the host inside the container where Multus allows you to pick the one you want.
-Also, this removes the need of privileged containers (required for `HostNetworking`).
+Also, this minimizes the need of privileged containers (required for `HostNetworking`).
 
 ## Proposed CRD changed
 
@@ -78,7 +78,123 @@ Nothing to do in particular since they don't use any service IPs.
 
 ### CSI pods
 
-We can add annotations to these pods and they can reach out the Ceph public network, then the driver will expose the block or the filesystem normally.
+Steps must be taken to fix a CSI-with-multus issue documented
+[here](https://github.com/rook/rook/issues/8085). To summarize the issue:
+when a CephFS/RBD volume is mounted in a pod using Ceph CSI and then the CSI CephFS/RBD plugin is
+restarted or terminated (e.g. by restarting or deleting its DaemonSet), all operations on the volume
+become blocked, even after restarting the CSI pods. The only workaround is to restart the node where
+the Ceph CSI plugin pod was restarted.
+
+When deploying a CephCluster resource configured to use multus networks, a multus-connected network
+interface will be added to the host network namespace for all nodes that will run CSI plugin pods.
+This will allow Ceph CSI pods to run using host networking and still access Ceph's public multus
+network.
+
+The design for mitigating the issue is comprised of two components: a "holder" DaemonSet and a
+"mover" Daemonset.
+
+#### Holder DaemonSet and Pods
+The Rook-Ceph Operator's CSI controller creates a DaemonSet configured to use the
+`network.selectors.public` network specified for the CephCluster. This DaemonSet runs on all the
+nodes that will have CSI plugin pods. Its pods exist to "hold" a particular network interface that
+the CSI pods can reliably connect to for communication with the Ceph cluster. The process running
+will merely be an infinite sleep.
+
+These Pods should only be stopped and restarted when a node is stopped so that volume operations do
+not become blocked. The Rook-Ceph Operator's CSI controller should set the DaemonSet's update
+strategy to `OnDelete` so that the pods do not get deleted if the DaemonSet is updated while also
+ensuring that the pods will be updated on the next node reboot (or node drain).
+
+#### Mover DaemonSet and Pods
+The Rook-Ceph Operator's CSI controller also creates a second DaemonSet configured to use host
+networking. This DaemonSet also runs on all nodes that will have CSI plugin pods (and holder pods).
+Mover pods exist to "move" the multus network interface being held by the holder pod on the node
+into the host's network namespace to provide user's volumes with uninterrupted access to the Ceph
+cluster, even when the CSI driver is restarted (or updated).
+
+The mover must:
+- be a privileged container
+- have `SYS_ADMIN` and `NET_ADMIN` capabilities
+- be on the host network namespace
+- have access to the `/var/run/netns` directory
+
+In order to not leave moved interfaces dangling in the host's network namespace, mover pods must
+move interfaces back to their original namespace when CSI is being terminated. The most
+straightforward way to accomplish this is to move interfaces back when the mover is being
+terminated. If an interface is moved without user applications also being removed, this will cause
+I/O disruption. Therefore, the DaemonSet should also use the `OnDelete` update strategy so that the
+pods can be updated on node reboots (or node drains).
+
+In order to better handle unexpected corner cases that leave moved interfaces in the host network
+namespace (e.g., a mover is killed abruptly rather than gracefully terminated), instead treat "move"
+operations as a disable-and-copy operation. To do this, disable the interface in the holder pod's
+network namespace, and create a copy of the interface in the host namespace with the same MAC
+address and IP config. From the user standpoint, the interface is still "moved" because the original
+is disabled, so we keep the "mover" terminology. This merely helps Rook ensure that it is not
+accidentally losing the original information.
+
+A previous iteration of this design specified the mover application as a sidecar to Ceph CSI plugin
+pods; however, this design would mean that the mover would need to be deleted and re-created
+whenever the CSI plugin is updated, possibly resulting in I/O hangs in user pods during the update.
+Keeping the mover independent allows CSI plugin updates to happen freely without complex
+interactions between it and the mover.
+
+#### Interactions between components
+If a copied interface is left in the host network namespace after the holder pod is removed, multus
+may later give the address to a different application, and the CSI driver may try to connect to the
+different application with Ceph requests. We should try to avoid leaving interfaces on the host as
+much as possible. Killing the mover pod abruptly will leave copied interfaces, but there is no way
+to prevent this from happening.
+
+If a holder pod is deleted, the interface hold will be lost. The mover must remove the
+interface from the host's network namespace because multus may reassign the address to a different
+application. We can document for users that they should not delete holder pods, but we cannot
+prevent users from manually stopping holder pods. However, to prevent the Kubernetes scheduler from
+terminating holder pods, the pods should be given the highest possible priority so that they are not
+un-scheduled except by users.
+
+There is a possible race condition where a mover pod is killed and where a holder is deleted before
+a new mover starts up. In this case, the copied interface for the holder pod will be left in the
+host network namespace, but the mover will not get a notification that the holder was removed. In
+order to clean up from this case, upon startup, the mover should delete all interface copies in the
+host network namespace that do not have a holder pod associated with them.
+
+If the mover container is stopped, it should delete all copied interfaces in the host's
+network namespace under the assumption that the CSI plugin is being removed, possibly by a
+CephCluster being deleted or the node going down for maintenance. It might be possible to optimize
+for the case where only the mover pod is being restarted, but it is very difficult to detect the
+case where the mover is merely being restarted versus when the holder is also being removed without
+possible race conditions between which pod might be stopped first by the Kubernetes API server
+during a drain event. Therefore, focus on the simplest working implementation (described above)
+instead of risking leaving the interface copy in host net namespace which could cause issues.
+
+If an error occurs in the mover during network migration, it will fail and re-try migration until
+the operation succeeds. If necessary, the mover will try to remove a partially-copied interface.
+
+Restarting a node will cause the multus interface in the host namespace to go away. On restart, the
+holder pod will get a new interface, and the mover will copy it into the host networking namespace again.
+
+When a new node is added, holder and mover pods are added to it by their DaemonSets, and the
+move/copy process described above occurs on the node.
+
+The holder and mover DaemonSets should be deleted when the CSI driver components are removed. Both
+the termination of the holder pods as well as the mover pods triggers the mover to remove the multus
+interfaces from the host network namespace of a given node.
+
+The initial implementation of this design will be limited to supporting a single CephCluster with
+Multus until we can be sure that the CSI plugin can support multiple migrated interfaces as well as
+interfaces that are added and removed dynamically. This limitation will be enforced by allowing only
+a single instance of the holder DaemonSet. A possible (future) partial implementation may be
+possible by restarting the CSI plugin Pods when network interfaces are added or removed.
+
+**Known issue:** the Docker container runtime does not use Linux's native `/var/run/netns`
+directory. This mitigation is known to work on cri-o runtime but not Docker. Therefore, this feature
+will be disabled by default and enabled optionally by the
+`ROOK_CSI_MULTUS_USE_HOLDER_MOVER_PATTERN=true` variable in the Rook-Ceph operator's config.
+
+A previous version of the CSI proposal had the holder Pods creating "setup" and "teardown"
+Kubernetes Jobs for migrating/un-migrating the multus networks. This design was rejected since new
+Jobs wouldn't be able to be created if the Kubernetes namespace were in "Terminating" state.
 
 ## Accepted proposal