Delegations for userland cephfs =============================== To properly implement NFSv4 delegations in ganesha, we need something that operates a little like Linux's fcntl(..., F_SETLEASE, ...). Ganesha needs to be able to set a lease on a file, and then be issued a callback when the lease is being revoked due to conflicting access by other clients. Cephfs already has a facility for recallable state -- the caps system. I think we can map NFSv4 delegation semantics on top of that. At I high level, what I'm envisioning is something like this as an application interface: typedef uint32_t ceph_deleg_t; // delegation token typedef int (*ceph_deleg_cb_t)(ceph_deleg deleg); // recall callback int ceph_ll_request_deleg(struct ceph_mount *cmount, Fh *fh, unsigned type, ceph_deleg_cb_t cb, ceph_deleg_t deleg); int ceph_ll_return_deleg(struct ceph_mount *cmount, ceph_deleg_t deleg); The type field would be something like READ or WRITE here, but we could make it more granular too (CEPH_STATX mask maybe?). The callback is a call into the application that will cue it to return the delegations within a certain period of time. That period of time would need to be determined (and probably be tunable), but you'd ideally want something longer than two NFSv4 lease periods, so the server can give the clients ample time to return them. ceph_deleg_t is just an opaque token that we'd give the application to represent a delegation. We could hand out pointers to the object here, but I think we want to be able to vet return requests coming from the application. When someone requests a deleg, we'd use get_caps to take references to a set of caps and record them inside the container, if they're available. You'd then attach the thing to a list in the Inode and a list in the open Fh. If you can't get one, return some distinct error code and don't set the return delegation token. When the Fh is closed, you'd drop any delegations. The application can always return them voluntarily at any time with ceph_ll_return_deleg. When the MDS wants to recall the caps, we'll issue the callback to the application -- ganesha in this case. ganesha would then issue a NFS CB_RECALL to the client(s) and eventually return the delegation once all of the clients have returned their delegations. If that doesn't occur in a certain amount of time (usually two NFSv4 lease periods -- 90s or so), ganesha should drop all of the client's state, and return the deleg unconditionally. If the application doesn't do this, then you'd probably want libcephfs to call abort() to forcibly kill off the client application. That should only be a last-resort sort of thing however, and applications should be coded to enforce timeouts on their own so as to avoid being killed. In order to do that, you would want libcephfs to set a timer when a recall is performed, and tear it down when all delegations have been returned. (POSIX timers would probably work well here, but maybe there's some better way in ceph-land)