Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fencing Question for rbd and cephfs to recover from node lost #7954

Closed
Madhu-1 opened this issue May 20, 2021 · 19 comments
Closed

Fencing Question for rbd and cephfs to recover from node lost #7954

Madhu-1 opened this issue May 20, 2021 · 19 comments
Labels

Comments

@Madhu-1
Copy link
Member

Madhu-1 commented May 20, 2021

Recently cephcsi pods in Rook have been updated to move away from host networking. I don't know we can/cannot hit this problem with multus networking.

As we are talking about recovery from node loss as mentioned at #7282. The node loss can happen for many reasons (Either kubelet is dead or whole node is dead etc)

The below problem might exist for both cephfs and RBD.

  • As we are moving to the pod networking, the pod IP of rbdplugin and the cephfsplugin will be used as a watcher, we can fence that IP but the problem is what if the pod is just restarted and got a new IP? again the pods running on the node which is disconnected are going to get read and write access to the volumes and can corrupt the data?

  • If the plugin pod on the other nodes is restarted and there is a slight chance that it can get the IP that is already fenced on the ceph side which can lead to read/write access denied to the existing applications on the node.

Ref #7282 (comment)

cc @idryomov Can you please suggest that the above can cause an issue or not? is there an alternative way to fence the client from accessing the rbd images?

@Madhu-1
Copy link
Member Author

Madhu-1 commented May 20, 2021

@Madhu-1 Madhu-1 added ceph main ceph tag csi question labels May 20, 2021
@nixpanic
Copy link
Contributor

Storage fencing (force detach) has been a proposal in the CSI Spec. We would get the node-id (ip-address or hostname) of the Pod that needs to be fenced. When using host-networking this should be stable and we can imagine this functioning correctly. However with non-hostnetworking it becomes questionable (asked for guidance at container-storage-interface/spec#477 (comment)).

@idryomov
Copy link

The below problem might exist for both cephfs and RBD.

  • As we are moving to the pod networking, the pod IP of rbdplugin and the cephfsplugin will be used as a watcher, we can fence that IP but the problem is what if the pod is just restarted and got a new IP? again the pods running on the node which is disconnected are going to get read and write access to the volumes and can corrupt the data?

What is meant by "we can fence"? If blocklisting via ceph osd blocklist add then you would be blocklisting a Ceph entity address (IP + unique nonce), not the entire IP address. Blocklisting the entire IP address is also possible, but not common because the purpose of OSD-level blocklisting is to deal with split-brain scenarios: prevent a particular client instance that is already connected to the cluster from flushing its buffered state that is no longer valid. If the client process is restarted (whether after rebooting the node or without the reboot), it will get a new Ceph entity address. No data corruption is possible because the restarted process wouldn't have any buffered state.

  • If the plugin pod on the other nodes is restarted and there is a slight chance that it can get the IP that is already fenced on the ceph side which can lead to read/write access denied to the existing applications on the node.

If you don't blocklist entire IP addresses, you wouldn't run into this.

@idryomov
Copy link

Also, what is a watcher? Are you talking about RADOS watches or something else?

@idryomov
Copy link

idryomov commented May 21, 2021

After skimming through linked issues, I see that it is RADOS watches. Please note that listing watchers and relying on that for telling what needs to be blocklisted is unreliable. An image may still be mapped somewhere (and be written to!) with no watchers in rbd status output.

And there are no watches in the CephFS case at all.

@Madhu-1
Copy link
Member Author

Madhu-1 commented May 21, 2021

After skimming through linked issues, I see that it is RADOS watches. Please note that listing watchers and relying on that for telling what needs to be blocklisted is unreliable. An image may still be mapped somewhere (and be written to!) with no watchers in rbd status output.

We do check watchers at the CSI level before mapping an RWO(ReadWriteOnly) rbd PVC. Is there any way to block the client from reading and write access to the rbd image from a particular node in case of node lost scenerio?

And there are no watches in the CephFS case at all.

Yes for CephFS I was talking about https://docs.ceph.com/en/latest/cephfs/eviction/#manual-client-eviction client evection.

@idryomov
Copy link

Is there any way to block the client from reading and write access to the rbd image from a particular node in case of node lost scenerio?

Yes, blocklisting. The issue is knowing what to blocklist. If you do the entire IP address, you run into IP address reuse issues.
Doing individual clients instances is much better but you need to know their entity addresses in advance. Getting them right before blocklisting from rbd status or rados listwatchers is not reliable.

@Madhu-1
Copy link
Member Author

Madhu-1 commented May 21, 2021

Is there any way to block the client from reading and write access to the rbd image from a particular node in case of node lost scenerio?

Yes, blocklisting. The issue is knowing what to blocklist. If you do the entire IP address, you run into IP address reuse issues.
Doing individual clients instances is much better but you need to know their entity addresses in advance. Getting them right before blocklisting from rbd status or rados listwatchers is not reliable.

With individual client instance blacklisting , Due to restart of rbdplugin ( pod ip might get change) and application pod will their be any chance to get read and write access again?

@idryomov
Copy link

With individual client instance blacklisting , Due to restart of rbdplugin ( pod ip might get change) and application pod will their be any chance to get read and write access again?

If the pod is restarted, it will be able to read and write again even if the IP address remains the same because it will get a new Ceph entity address (same IP but different nonce).

@Madhu-1
Copy link
Member Author

Madhu-1 commented May 21, 2021

With individual client instance blacklisting , Due to restart of rbdplugin ( pod ip might get change) and application pod will their be any chance to get read and write access again?

If the pod is restarted, it will be able to read and write again even if the IP address remains the same because it will get a new Ceph entity address (same IP but different nonce).

Then using pod networking for CSI would be a problem? If we switch back to host networking and blacklist the IP will that solve the problem? or there any other solutions to fix this issue?

@Madhu-1
Copy link
Member Author

Madhu-1 commented May 21, 2021

The below problem might exist for both cephfs and RBD.

  • As we are moving to the pod networking, the pod IP of rbdplugin and the cephfsplugin will be used as a watcher, we can fence that IP but the problem is what if the pod is just restarted and got a new IP? again the pods running on the node which is disconnected are going to get read and write access to the volumes and can corrupt the data?

What is meant by "we can fence"? If blocklisting via ceph osd blocklist add then you would be blocklisting a Ceph entity address (IP + unique nonce), not the entire IP address. Blocklisting the entire IP address is also possible, but not common because the purpose of OSD-level blocklisting is to deal with split-brain scenarios: prevent a particular client instance that is already connected to the cluster from flushing its buffered state that is no longer valid. If the client process is restarted (whether after rebooting the node or without the reboot), it will get a new Ceph entity address. No data corruption is possible because the restarted process wouldn't have any buffered state.

What I meant here was the inconsistent data as the application might writing some data from the disconnected node, Which will lead to inconsistent data on the RBD/CephFS volume.

  • If the plugin pod on the other nodes is restarted and there is a slight chance that it can get the IP that is already fenced on the ceph side which can lead to read/write access denied to the existing applications on the node.

If you don't blocklist entire IP addresses, you wouldn't run into this.

@travisn
Copy link
Member

travisn commented May 21, 2021

I believe the goal of this discussion is to determine:

  1. Is it ok to use pod networking, or do we need to use host networking as before?
  2. Is there any risk of corruption when using pod networking, that we wouldn't see with host networking?
  3. Document steps for an admin to ensure data safety when a node is lost.

If a pod is force deleted, after some timeout (~11 minutes), the pod will be allowed to start on a new node and mount the volume. In order for this to be safe, is it required that the original node be permanently offline so the old volume doesn't write and cause corruption?

If the admin can't guarantee the node is permanently offline, this is when they need to blocklist the node, right? And in this case, it seems like we need to permanently blocklist either the node's ip address (if using host networking), or the pod's ip address (if not using host networking). In the latter case, if the pod could be restarted with a different ip address, the volume could come back online and cause corruption, right? So don't we need to use host networking and recommend users blocklist the node's ip? Blocklisting anything else seems like it wouldn't protect against corruption from two writers, as long as the failed node has the possibility of coming back online.

@Madhu-1
Copy link
Member Author

Madhu-1 commented Jun 2, 2021

@idryomov can you please provide your feedback?

@idryomov
Copy link

idryomov commented Jun 3, 2021

So far I don't see why host networking vs pod networking would play a role here.

If the original Ceph entity is blocklisted, that guarantees that the original pod wouldn't be able to come back online and cause corruption. Unless we are considering the failed node going rogue to the extent where it can restart the original pod on its own, I don't see how two writers can arise -- blocklisting the Ceph entity (which is per-pod independent of the networking setup) should be sufficient.

Please correct me if I am misunderstanding the "threat" model.

@ShyamsundarR
Copy link
Contributor

So far I don't see why host networking vs pod networking would play a role here.

If the original Ceph entity is blocklisted, that guarantees that the original pod wouldn't be able to come back online and cause corruption. Unless we are considering the failed node going rogue to the extent where it can restart the original pod on its own, I don't see how two writers can arise -- blocklisting the Ceph entity (which is per-pod independent of the networking setup) should be sufficient.

Please correct me if I am misunderstanding the "threat" model.

What I see (unverified) is that pod network IPs can be assigned to pods across nodes at will. If a pod in a node is using IP1 and the node is deemed unavailable, we would blocklist IP1. Nothing prevents IP1 being reassigned to another pod in the cluster on another node. This is a denial of service rather than a potential data corruption issue.

If IP1 was a node IP instead, this above not feasible, and hence safer once blocklisted.

The other case is, if the pod is restarted on the unavailable node due to any reason (local kubelet action on the node, again unverified), then it may use up a new pod IP and hence avoid the blocklist. In this situation we may end up with 2 writers.

Using the node IP resolves both (possible) edge cases.

@idryomov
Copy link

idryomov commented Jun 3, 2021

What I see (unverified) is that pod network IPs can be assigned to pods across nodes at will. If a pod in a node is using IP1 and the node is deemed unavailable, we would blocklist IP1. Nothing prevents IP1 being reassigned to another pod in the cluster on another node. This is a denial of service rather than a potential data corruption issue.

I was suggesting blocklisting a Ceph entity address instead of the entire IP address, precisely to avoid the potential denial of service scenario. But ...

The other case is, if the pod is restarted on the unavailable node due to any reason (local kubelet action on the node, again unverified), then it may use up a new pod IP and hence avoid the blocklist. In this situation we may end up with 2 writers.

... if this is something we need to guard against, blocklisting at the IP level is the only option.

I have a hard time wrapping my head around this though. If workers can restart pods on their own without talking to the control plane, what is the procedure for adding a failed worker back to the cluster? When re-adding the node at the Kubernetes level, who/what would know to unblocklist it at the Ceph level?

@Madhu-1
Copy link
Member Author

Madhu-1 commented Jun 4, 2021

What I see (unverified) is that pod network IPs can be assigned to pods across nodes at will. If a pod in a node is using IP1 and the node is deemed unavailable, we would blocklist IP1. Nothing prevents IP1 being reassigned to another pod in the cluster on another node. This is a denial of service rather than a potential data corruption issue.

I was suggesting blocklisting a Ceph entity address instead of the entire IP address, precisely to avoid the potential denial of service scenario. But ...

The other case is, if the pod is restarted on the unavailable node due to any reason (local kubelet action on the node, again unverified), then it may use up a new pod IP and hence avoid the blocklist. In this situation we may end up with 2 writers.

... if this is something we need to guard against, blocklisting at the IP level is the only option.

I have a hard time wrapping my head around this though. If workers can restart pods on their own without talking to the control plane, what is the procedure for adding a failed worker back to the cluster? When re-adding the node at the Kubernetes level, who/what would know to unblocklist it at the Ceph level?

Two cases. If only kubelet (control plane) was dead/not reachable it will get auto-added when it comes back. if the node was dead and the admin has to fix the problems manually, AFAIC based on the application logs or pod health he needs to unblocklist as the blacklisting is a manual process.

@github-actions
Copy link

github-actions bot commented Sep 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants