|
Delegations for userland cephfs
|
|
===============================
|
|
|
|
To properly implement NFSv4 delegations in ganesha, we need something
|
|
that operates a little like Linux's fcntl(..., F_SETLEASE, ...). Ganesha
|
|
needs to be able to set a lease on a file, and then be issued a callback
|
|
when the lease is being revoked due to conflicting access by other
|
|
clients.
|
|
|
|
Cephfs already has a facility for recallable state -- the caps system. I
|
|
think we can map NFSv4 delegation semantics on top of that.
|
|
|
|
At I high level, what I'm envisioning is something like this as an
|
|
application interface:
|
|
|
|
typedef uint32_t ceph_deleg_t; // delegation token
|
|
typedef int (*ceph_deleg_cb_t)(ceph_deleg deleg); // recall callback
|
|
int ceph_ll_request_deleg(struct ceph_mount *cmount, Fh *fh, unsigned type,
|
|
ceph_deleg_cb_t cb, ceph_deleg_t deleg);
|
|
int ceph_ll_return_deleg(struct ceph_mount *cmount, ceph_deleg_t deleg);
|
|
|
|
The type field would be something like READ or WRITE here, but we could
|
|
make it more granular too (CEPH_STATX mask maybe?).
|
|
|
|
The callback is a call into the application that will cue it to return
|
|
the delegations within a certain period of time. That period of time
|
|
would need to be determined (and probably be tunable), but you'd
|
|
ideally want something longer than two NFSv4 lease periods, so the
|
|
server can give the clients ample time to return them.
|
|
|
|
ceph_deleg_t is just an opaque token that we'd give the application to
|
|
represent a delegation. We could hand out pointers to the object here,
|
|
but I think we want to be able to vet return requests coming from the
|
|
application.
|
|
|
|
When someone requests a deleg, we'd use get_caps to take references to
|
|
a set of caps and record them inside the container, if they're available.
|
|
You'd then attach the thing to a list in the Inode and a list in the
|
|
open Fh.
|
|
|
|
If you can't get one, return some distinct error code and don't set the
|
|
return delegation token.
|
|
|
|
When the Fh is closed, you'd drop any delegations. The application can
|
|
always return them voluntarily at any time with ceph_ll_return_deleg.
|
|
|
|
When the MDS wants to recall the caps, we'll issue the callback to the
|
|
application -- ganesha in this case. ganesha would then issue a NFS
|
|
CB_RECALL to the client(s) and eventually return the delegation once all
|
|
of the clients have returned their delegations.
|
|
|
|
If that doesn't occur in a certain amount of time (usually two NFSv4 lease
|
|
periods -- 90s or so), ganesha should drop all of the client's state, and
|
|
return the deleg unconditionally.
|
|
|
|
If the application doesn't do this, then you'd probably want libcephfs
|
|
to call abort() to forcibly kill off the client application. That should
|
|
only be a last-resort sort of thing however, and applications should be
|
|
coded to enforce timeouts on their own so as to avoid being killed.
|
|
|
|
In order to do that, you would want libcephfs to set a timer when a
|
|
recall is performed, and tear it down when all delegations have been
|
|
returned. (POSIX timers would probably work well here, but maybe there's
|
|
some better way in ceph-land)
|