[Question] How does nfs-ganesha avoid state reclaimed in edge conditions? #1130

zhitaoli-6 · 2024-05-10T14:20:07Z

According to NFS 4.1 RFC 8881, Section 8.4.3

When a network partition is combined with a server restart, there are edge conditions that place requirements on the server in order to avoid silent data corruption following the server restart. Two of these edge conditions are known, and are discussed below.
...

There exist cases where NFS client may reclaim a lock which has been granted to others when the NFS client can't communicate with nfs-servers. The nfs-server must handle these cases whether it rejects all lock reclaim or check lock reclaim in a fine-grained way by recording more information in stable storage.

I have not seen these considerations from our nfs-ganesha implementation. Nfs-ganesha handles OP_RECLAIM_COMPLEETE very simply, without marking the co_ownerid into RecoveryBackend. And on expiration of some clientid, nfs-ganesha doesn't mark the co_ownerid expired into RecoveryBackend, either.

So in order to avoid these cases, FSAL has to record enough information to reject locks reclaim if those locks have been granted to others.

Could our community give some ideas about this issue?

ffilz · 2024-05-10T18:08:33Z

The server restart cases are handled by the fact that on restart, Ganesha reads the set of clients allowed to reclaim locks from the recovery data base. It then empties the recovery data base. Any client that is still partitioned and thus unable to reclaim state will not be present in the new recovery database and it will not be allowed to subsequently reclaim locks.

Ganesha also supports courteous server semantics where a client that doesn't renew in time doesn't lose it's state, but that state is released immediately if any conflicting state is requested the non-renewed client is immediately expired with all its state.

zhitaoli-6 · 2024-05-11T06:59:52Z

It seems that following case is not handled by nfs-ganesha.

The second known edge condition arises in situations such as the following:
Client A acquires one or more locks.
Server restarts.
Client A and server experience mutual network partition, such that client A is unable to reclaim all of its locks within the grace period.
Server's reclaim grace period ends. Client A has either no locks or an incomplete set of locks known to the server.
Client B acquires a lock that would have conflicted with a lock of client A that was not reclaimed.
Client B releases the lock.
Server restarts a second time.
Network partition between client A and server heals.
Client A connects to new server instance and finds out about server restart.
Client A reclaims its lock within the server's grace period.

ffilz · 2024-05-13T19:37:56Z

Ah, OK, you're concerned about the case where the client has started reclaiming (or at least done a new SETCLIENTID or EXCHANGEID but not completed reclaim after the first server restart. So in that case, the server records the clientid in the recovery data base, which would allow it to reclaim after the second restart.

We do handle RECLAIM_COMPLETE so we could utilize that to only harden clientids established during grace period once RECLAIM_COMPLETE is processed.

zhitaoli-6 · 2024-05-14T03:28:23Z

Yeah. We need to enhance RECLAIM_COMPLETE.

zhitaoli-6 · 2024-05-17T09:26:33Z

Is there any plan to fix this issue?

We may fix this as follows:

On OP_RECLAIM_COMPLETE, we mark the clientid as reclaim_completed in the recovery backend with some new API.
On starting grace, we load cid_recover_tag, reclaim_completed list and whether grace periods terminates(denoted by end_grace) from recovery_backend.

If cid_recover_tag does match, end_grace is true, but reclaim_completed is false, the clientid is not allowed to do reclaim.

This design will change a lot about recovery_backend and need adaption of recovery_backend implementations.

ffilz · 2024-05-17T17:27:24Z

Code contributions are always welcome. Otherwise, it's in my backlog.

zhitaoli-6 · 2024-05-20T07:05:28Z

Ah, OK, you're concerned about the case where the client has started reclaiming (or at least done a new SETCLIENTID or EXCHANGEID but not completed reclaim after the first server restart. So in that case, the server records the clientid in the recovery data base, which would allow it to reclaim after the second restart.

We do handle RECLAIM_COMPLETE so we could utilize that to only harden clientids established during grace period once RECLAIM_COMPLETE is processed.

According to NFS 4.0 RFC 7530 Section 9.6.3.4 Edge Conditions:

When a network partition is combined with a server reboot, then both
the server and client have responsibilities to ensure that the client
does not reclaim a lock that it should no longer be able to access.
Briefly, those are:
Client's responsibility: A client MUST NOT attempt to reclaim any
locks that it did not hold at the end of its most recent
successfully established client lease.

If the NFS client fulfills its responsibility, it won't reclaim its lock again if it has started reclaiming(by SET_CLIENTID or EXCHANGEID) but fails to reclaim that lock. The above case doesn't exist anymore.

However, I find that Linux NFS client(CentOS8) doesn't fulfill this responsibility in my practice. And I don't find the description about NFS clients responsibility in NFS v4.1 RFC 8881.

Could you share some information about NFS clients' responsibility in edge conditions?

ffilz · 2024-05-20T15:07:25Z

Client preferably would not try to reclaim if it detects an edge case, but we can't totally trust client...

So we should take more care, and I think we can do a better job here. We just need someone with time to make the fix...

zhitaoli-6 · 2024-05-22T07:54:21Z

Patch has been submitted. The patch adds enforcement to reject reclaim in some edge conditions. No recovery_backend enables this feature now. This patch focuses on the common mechanism, and patches to enable this feature can be submitted in the future in need.

ffilz · 2024-05-22T15:56:02Z

I'm wondering exactly how this is going to work...

We need to track clients that started to reclaim, but we need to make an atomic switch at the point we end grace. Before we end grace, if we restart again, we still need to use the old recovery list since no client lost state whether it started reclaim at all or finished. At the point grace ends, any clients that didn't finish reclaim state are now out of luck. Or does the recovery system already handle that somehow?

Also, don't we need to do something here for NFSv4.0 clients?

In theory we also should do something for NFSv3 clients, but that's all in how statd works...

zhitaoli-6 · 2024-05-23T04:02:23Z

Now let's focus on NFS v4.1+ clients. For some clientid added into recovery_backend, we also record whether it reclaimed completely and whether last grace period terminated. We don't distinguish whether OP_RECLAIM_COMPLETE is called during grace period or after grace period. As long as the OP is called, the client must not to reclaim states anymore.

We need to track clients that started to reclaim, but we need to make an atomic switch at the point we end grace. Before we end grace, if we restart again, we still need to use the old recovery list since no client lost state whether it started reclaim at all or finished. At the point grace ends, any clients that didn't finish reclaim state are now out of luck. Or does the recovery system already handle that somehow?

Indeed, our recovery_db is implemented similarly as above:
There are two fields reclaiming_clids and clientids .

On EVENT_TAKE_VIP, the vip enters grace period, then clientids are loaded and new clientid is recorded into reclaiming_clids.
When end_grace is called, reclaiming_clids replaces clientids atomically, and reclaiming_clids is reset, any new clientid is recorded into clientids. So in our case, clientids are all clientids for which grace period terminated.

If there is some clientid in clientids without OP_RECLAIM_COMPLETE, it is not allowed to reclaim during next grace period.

zhitaoli-6 · 2024-05-23T04:36:10Z

It is not very easy to distinguish whether last grace period terminates. Can we simplify the solution by just checking whether the clientid reclaimed completely? If end_grace() is not called, old clientids before grace_period MAY(according to implementation) exist in recovery_db. The NFS client is allowed to reclaim although the new clientid without reclaim complete is not allowed.

zhitaoli-6 · 2024-05-30T08:21:27Z

What's your opinion? If there is anything wrong, please point it out.

zhitaoli-6 changed the title ~~How does nfs-ganesha avoid state reclaimed in edge conditions?~~ [Question] How does nfs-ganesha avoid state reclaimed in edge conditions? May 10, 2024

ffilz added the question label May 10, 2024

ffilz added bug and removed question labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How does nfs-ganesha avoid state reclaimed in edge conditions? #1130

[Question] How does nfs-ganesha avoid state reclaimed in edge conditions? #1130

zhitaoli-6 commented May 10, 2024

ffilz commented May 10, 2024

zhitaoli-6 commented May 11, 2024

ffilz commented May 13, 2024

zhitaoli-6 commented May 14, 2024

zhitaoli-6 commented May 17, 2024

ffilz commented May 17, 2024

zhitaoli-6 commented May 20, 2024

ffilz commented May 20, 2024

zhitaoli-6 commented May 22, 2024

ffilz commented May 22, 2024

zhitaoli-6 commented May 23, 2024

zhitaoli-6 commented May 23, 2024

zhitaoli-6 commented May 30, 2024

[Question] How does nfs-ganesha avoid state reclaimed in edge conditions? #1130

[Question] How does nfs-ganesha avoid state reclaimed in edge conditions? #1130

Comments

zhitaoli-6 commented May 10, 2024

ffilz commented May 10, 2024

zhitaoli-6 commented May 11, 2024

ffilz commented May 13, 2024

zhitaoli-6 commented May 14, 2024

zhitaoli-6 commented May 17, 2024

ffilz commented May 17, 2024

zhitaoli-6 commented May 20, 2024

ffilz commented May 20, 2024

zhitaoli-6 commented May 22, 2024

ffilz commented May 22, 2024

zhitaoli-6 commented May 23, 2024

zhitaoli-6 commented May 23, 2024

zhitaoli-6 commented May 30, 2024