New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fencing Question for rbd and cephfs to recover from node lost #7954
Comments
Storage fencing (force detach) has been a proposal in the CSI Spec. We would get the node-id (ip-address or hostname) of the Pod that needs to be fenced. When using host-networking this should be stable and we can imagine this functioning correctly. However with non-hostnetworking it becomes questionable (asked for guidance at container-storage-interface/spec#477 (comment)). |
What is meant by "we can fence"? If blocklisting via
If you don't blocklist entire IP addresses, you wouldn't run into this. |
Also, what is a watcher? Are you talking about RADOS watches or something else? |
After skimming through linked issues, I see that it is RADOS watches. Please note that listing watchers and relying on that for telling what needs to be blocklisted is unreliable. An image may still be mapped somewhere (and be written to!) with no watchers in And there are no watches in the CephFS case at all. |
We do check watchers at the CSI level before mapping an RWO(ReadWriteOnly) rbd PVC. Is there any way to block the client from reading and write access to the rbd image from a particular node in case of node lost scenerio?
Yes for CephFS I was talking about https://docs.ceph.com/en/latest/cephfs/eviction/#manual-client-eviction client evection. |
Yes, blocklisting. The issue is knowing what to blocklist. If you do the entire IP address, you run into IP address reuse issues. |
With individual client instance blacklisting , Due to restart of rbdplugin ( pod ip might get change) and application pod will their be any chance to get read and write access again? |
If the pod is restarted, it will be able to read and write again even if the IP address remains the same because it will get a new Ceph entity address (same IP but different nonce). |
Then using pod networking for CSI would be a problem? If we switch back to host networking and blacklist the IP will that solve the problem? or there any other solutions to fix this issue? |
What I meant here was the inconsistent data as the application might writing some data from the disconnected node, Which will lead to inconsistent data on the RBD/CephFS volume.
|
I believe the goal of this discussion is to determine:
If a pod is force deleted, after some timeout (~11 minutes), the pod will be allowed to start on a new node and mount the volume. In order for this to be safe, is it required that the original node be permanently offline so the old volume doesn't write and cause corruption? If the admin can't guarantee the node is permanently offline, this is when they need to blocklist the node, right? And in this case, it seems like we need to permanently blocklist either the node's ip address (if using host networking), or the pod's ip address (if not using host networking). In the latter case, if the pod could be restarted with a different ip address, the volume could come back online and cause corruption, right? So don't we need to use host networking and recommend users blocklist the node's ip? Blocklisting anything else seems like it wouldn't protect against corruption from two writers, as long as the failed node has the possibility of coming back online. |
@idryomov can you please provide your feedback? |
So far I don't see why host networking vs pod networking would play a role here. If the original Ceph entity is blocklisted, that guarantees that the original pod wouldn't be able to come back online and cause corruption. Unless we are considering the failed node going rogue to the extent where it can restart the original pod on its own, I don't see how two writers can arise -- blocklisting the Ceph entity (which is per-pod independent of the networking setup) should be sufficient. Please correct me if I am misunderstanding the "threat" model. |
What I see (unverified) is that pod network IPs can be assigned to pods across nodes at will. If a pod in a node is using If The other case is, if the pod is restarted on the unavailable node due to any reason (local kubelet action on the node, again unverified), then it may use up a new pod IP and hence avoid the blocklist. In this situation we may end up with 2 writers. Using the node IP resolves both (possible) edge cases. |
I was suggesting blocklisting a Ceph entity address instead of the entire IP address, precisely to avoid the potential denial of service scenario. But ...
... if this is something we need to guard against, blocklisting at the IP level is the only option. I have a hard time wrapping my head around this though. If workers can restart pods on their own without talking to the control plane, what is the procedure for adding a failed worker back to the cluster? When re-adding the node at the Kubernetes level, who/what would know to unblocklist it at the Ceph level? |
Two cases. If only kubelet (control plane) was dead/not reachable it will get auto-added when it comes back. if the node was dead and the admin has to fix the problems manually, AFAIC based on the application logs or pod health he needs to unblocklist as the blacklisting is a manual process. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
Recently cephcsi pods in Rook have been updated to move away from host networking. I don't know we can/cannot hit this problem with multus networking.
As we are talking about recovery from node loss as mentioned at #7282. The node loss can happen for many reasons (Either kubelet is dead or whole node is dead etc)
The below problem might exist for both cephfs and RBD.
As we are moving to the pod networking, the pod IP of rbdplugin and the cephfsplugin will be used as a watcher, we can fence that IP but the problem is what if the pod is just restarted and got a new IP? again the pods running on the node which is disconnected are going to get read and write access to the volumes and can corrupt the data?
If the plugin pod on the other nodes is restarted and there is a slight chance that it can get the IP that is already fenced on the ceph side which can lead to read/write access denied to the existing applications on the node.
Ref #7282 (comment)
cc @idryomov Can you please suggest that the above can cause an issue or not? is there an alternative way to fence the client from accessing the rbd images?
The text was updated successfully, but these errors were encountered: