Project

General

Profile

Feature #18490 » ceph-deleg.txt

Ceph delegation design document - Jeff Layton, 06/01/2017 03:04 PM

 
Delegations for userland cephfs
===============================

To properly implement NFSv4 delegations in ganesha, we need something
that operates a little like Linux's fcntl(..., F_SETLEASE, ...). Ganesha
needs to be able to set a lease on a file, and then be issued a callback
when the lease is being revoked due to conflicting access by other
clients.

Cephfs already has a facility for recallable state -- the caps system. I
think we can map NFSv4 delegation semantics on top of that.

At I high level, what I'm envisioning is something like this as an
application interface:

typedef uint32_t ceph_deleg_t; // delegation token
typedef int (*ceph_deleg_cb_t)(ceph_deleg deleg); // recall callback
int ceph_ll_request_deleg(struct ceph_mount *cmount, Fh *fh, unsigned type,
ceph_deleg_cb_t cb, ceph_deleg_t deleg);
int ceph_ll_return_deleg(struct ceph_mount *cmount, ceph_deleg_t deleg);

The type field would be something like READ or WRITE here, but we could
make it more granular too (CEPH_STATX mask maybe?).

The callback is a call into the application that will cue it to return
the delegations within a certain period of time. That period of time
would need to be determined (and probably be tunable), but you'd
ideally want something longer than two NFSv4 lease periods, so the
server can give the clients ample time to return them.

ceph_deleg_t is just an opaque token that we'd give the application to
represent a delegation. We could hand out pointers to the object here,
but I think we want to be able to vet return requests coming from the
application.

When someone requests a deleg, we'd use get_caps to take references to
a set of caps and record them inside the container, if they're available.
You'd then attach the thing to a list in the Inode and a list in the
open Fh.

If you can't get one, return some distinct error code and don't set the
return delegation token.

When the Fh is closed, you'd drop any delegations. The application can
always return them voluntarily at any time with ceph_ll_return_deleg.

When the MDS wants to recall the caps, we'll issue the callback to the
application -- ganesha in this case. ganesha would then issue a NFS
CB_RECALL to the client(s) and eventually return the delegation once all
of the clients have returned their delegations.

If that doesn't occur in a certain amount of time (usually two NFSv4 lease
periods -- 90s or so), ganesha should drop all of the client's state, and
return the deleg unconditionally.

If the application doesn't do this, then you'd probably want libcephfs
to call abort() to forcibly kill off the client application. That should
only be a last-resort sort of thing however, and applications should be
coded to enforce timeouts on their own so as to avoid being killed.

In order to do that, you would want libcephfs to set a timer when a
recall is performed, and tear it down when all delegations have been
returned. (POSIX timers would probably work well here, but maybe there's
some better way in ceph-land)
(1-1/2)