Project

General

Profile

Feature #18490

client: implement delegation support in userland cephfs

Added by Jeff Layton about 2 years ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
nfs-ganesha
Target version:
-
Start date:
01/11/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Reviewed:
Affected Versions:
Component(FS):
Client, Ganesha FSAL
Labels (FS):
Pull request ID:

Description

To properly implement NFSv4 delegations in ganesha, we need something that operates a little like Linux's fcntl(..., F_SETLEASE, ...). Ganesha needs to be able to set a lease on a file, and then be issued a callback when the lease is being revoked.

Fortunately, ceph already has a facility for recallable state -- the caps system. I think we can map the semantics we need on top of that.

At I high level, what I'm envisioning is something like this:

int ceph_ll_setlease(struct ceph_mount *cmount, Fh *fh, int cmd, unsigned mask, setlease_callback_t cb);

This function would create a "ceph_lease" object to hang off the inode with a CEPH_STATX_* mask that indicates what attributes we want to get a lease on. That object would use get_caps to get references to the required caps and then hold them there. When the MDS wants to recall the caps, we'll issue the callback to the application (ganesha in this case).

ganesha would then issue a NFS CB_RECALL and eventually drop the lease via another ceph_setlease call once the client returns the delegation. If that doesn't occur in a certain amount of time (usually two NFSv4 lease periods -- 90s or so), we'll drop the lease unconditionally (and maybe abort() the program?).

One nice thing here is that this shouldn't require any MDS changes (though we may need to work out how to ensure that the client doesn't get evicted).

I think that this mechanism would also be suitable for implementing cluster coherent oplocks for samba as well.

ceph-deleg.txt View - Ceph delegation design document (2.96 KB) Jeff Layton, 06/01/2017 03:04 PM

ceph-delegation.pcap.pcapng - v4.0 read delegation and recall (9.25 KB) Jeff Layton, 08/31/2017 01:03 PM


Related issues

Copied to fs - Backport #22407: luminous: client: implement delegation support in userland cephfs Resolved

History

#1 Updated by John Spray about 2 years ago

  • Category set to nfs-ganesha

I've created an nfs-ganesha category to match our Samba category.

#2 Updated by Greg Farnum about 2 years ago

This is basically what we've discussed previously in this area. My main concern is just designing an interface that can be used effectively by the external clients without letting Ganesha bugs break access to the CephFS system. I'm thinking we need to do lease breaking internally (inside our Client) on timeouts, rather than relying on Ganesha code to be correct. Similarly we'll need to make sure Clients can keep sessions alive while they are waiting; I'm not sure how hard it'll be to block progress on something like cap recalls with an external blocker.

#3 Updated by Jeff Layton about 2 years ago

Matt B. also had some upcall/invalidate work that may be relevant here that he has in these branches:

https://github.com/linuxbox2/nfs-ganesha/tree/ceph-invalidates
https://github.com/linuxbox2/ceph/tree/invalidate

#4 Updated by John Spray about 2 years ago

Ah, I think I had (incorrectly) assumed that the work Matt did on invalidations before had been merged, but if that's not the case we'll need to progress that. Created http://tracker.ceph.com/issues/18537

#5 Updated by John Spray almost 2 years ago

  • Tracker changed from Bug to Feature

#6 Updated by Jeff Layton almost 2 years ago

I started taking a look at this. One thing we have to solve first, is that I don't think there is any automatic resolution when the MDS revokes caps from a client and it decides not to return them in a timely fashion. I think we'll need to implement that first.

I think basically, we need to have the MDS set a timer when caps are first revoked, and then basically do this if they are not returned before the timer pops:

http://docs.ceph.com/docs/master/cephfs/eviction/

Only after that should we hand out potentially conflicting caps. Even better would be to only evict the client when we have a request that conflicts with the caps it holds, and the timer has expired.

We'll probably also want to allow an interface for setting that timer, or fetching its value. For ganesha, we'd want to set it on the order of 180s (2 lease periods).

#7 Updated by John Spray almost 2 years ago

So the "client completely unresponsive but only evict it when someone else wants its caps" case is http://tracker.ceph.com/issues/17854

I think this is a more specific "client is responsive but failing to give up certain caps..." case, right? That would be the condition that we currently detect as a health warning (MDS_HEALTH_CLIENT_LATE_RELEASE), promoted to evicting clients instead of just warning.

However, we don't really want to evict the cephfs client (ganesha) in this case, right? I was thinking we should be calling up to ganesha to say "hey, can you get rid of the NFS client who is holding onto this resource?" (in the most hand-wavy way possible)

#8 Updated by Jeff Layton almost 2 years ago

John Spray wrote:

So the "client completely unresponsive but only evict it when someone else wants its caps" case is http://tracker.ceph.com/issues/17854

Good, and yeah, there is little reason to evict a client unless there is a conflict. So that is sort of built into the design here. If it can come back into the good graces of the MDS at some point, so much the better.

I think this is a more specific "client is responsive but failing to give up certain caps..." case, right? That would be the condition that we currently detect as a health warning (MDS_HEALTH_CLIENT_LATE_RELEASE), promoted to evicting clients instead of just warning.

Yes, that's a case I'm concerned with. Thanks for the MDS_HEALTH_CLIENT_LATE_RELEASE pointer, I'll plan to look at that soon.

However, we don't really want to evict the cephfs client (ganesha) in this case, right? I was thinking we should be calling up to ganesha to say "hey, can you get rid of the NFS client who is holding onto this resource?" (in the most hand-wavy way possible)

tl;dr: Yes, we do want to evict the client in this case.

Longer explanation:

When the application requests a delegation or lease (which is just a container that holds cap references), it passes in a function pointer to be called when the caps are being revoked. That function just acts as a notification that the state is being recalled.

At that point, the application should get a reasonable period of time to return those caps, after which we'll evict it if it doesn't return them.

With ganesa, we'll have it request a delegation from ceph when a file is opened (and there is no contention for it). If it gets the delegation, then it should hand it out to the NFS client. When the callback is called (someone needs conflicting caps), ganesha should issue a NFSv4 CB_RECALL to recall the delegation from the NFS client.

The NFS client will eventually return it, and ganesha will give it back to ceph. If the client doesn't return the delegation within 2 lease periods, then ganesha should revoke the NFS client's lease and all of its state. At that point, the delegation will be returned to ceph and the cap references released.

Still...we can't fully rely on that -- we must be able to deal with applications that aren't cooperating. So we do still need to evict ganesha from the ceph cluster if that fails to occur with a certain time period. That should really only happen in the case of application bugs though, and it's part of the contract here for getting the lease in the first place.

The tricky bit is how to sort out the different timeouts involved. The timeout to evict a ceph client must be longer than the timeout that ganesha uses to evict NFS clients. That may mean that we need tunables, or some way for applications to know what the ceph eviction timeout is. NFS pretty much demands that you wait at least 2 lease periods before evicting the client, and a typical lease period is 90s. ISTR that the timeout for smb oplocks is lower than that, but I don't recall exactly.

So, the upshot there is that ideally we'd like the ceph cap timeout to be >180s. Maybe 210s or so just be on the safe side?

#9 Updated by Jeff Layton almost 2 years ago

Zheng asked a pointed question about this today, so to be clear...

This would be 100% an opportunistic thing. You only want to hand out a delegation if there is no existing conflicting access. So, you only give out a read delegation if you know that no one has it open for write, and only give out a write delegation if no one else has it open at all.

In knfsd we even wait a little while before (~30s or so) before handing one out for the same filehandle after one was recalled. We'll probably want to do the same here, but my current thinking is to handle that in ganesha or samba, and allow ceph to hand them out as long as there aren't any conflicts.

I think we'd only want to hand out a write delegation if we can get either CEPH_CAP_FILE_BUFFER, and a read delegation if we can get CEPH_CAP_FILE_CACHE.

BTW: CEPH_CAPFILE_BUFFER does also imply CEPH_CAP_FILE_CACHE, doesn't it?

#10 Updated by Greg Farnum almost 2 years ago

BTW: CEPH_CAPFILE_BUFFER does also imply CEPH_CAP_FILE_CACHE, doesn't it?

No, I don't think it does. In practice getting BUFFER without CACHE is pretty unlikely and there may be some hook in the code that prevents it going out independently, but I don't think that's part of the wire protocol.

In fact if you have client.A doing something like opening a file for write while you have a conflicting reader client.B who goes away, I think you'd start out with B=Fr,A=Fw; then get granted A=Fwb when B goes away; and then move to A=Fw,C=Fr when client.C comes along to do some reading.

On a different subject, while we definitely do need to do server-side client eviction, I think we also want the client library to do some cap return policing of its own when you're using these cap extension interfaces (do we have a good name for them yet?). That way the client could disallow only certain misbehaving file handles instead of getting the whole system booted.

#11 Updated by Jeff Layton almost 2 years ago

Greg Farnum wrote:

BTW: CEPH_CAPFILE_BUFFER does also imply CEPH_CAP_FILE_CACHE, doesn't it?

No, I don't think it does. In practice getting BUFFER without CACHE is pretty unlikely and there may be some hook in the code that prevents it going out independently, but I don't think that's part of the wire protocol.

In fact if you have client.A doing something like opening a file for write while you have a conflicting reader client.B who goes away, I think you'd start out with B=Fr,A=Fw; then get granted A=Fwb when B goes away; and then move to A=Fw,C=Fr when client.C comes along to do some reading.

Ok, I probably didn't phrase the question right...

I know that Fb means that you can buffer writes and Fc means you can cache reads, but logically, if you have Fb caps then you should also be able to cache reads?

On a different subject, while we definitely do need to do server-side client eviction, I think we also want the client library to do some cap return policing of its own when you're using these cap extension interfaces (do we have a good name for them yet?). That way the client could disallow only certain misbehaving file handles instead of getting the whole system booted.

Maybe. I'm not sure how well that will work in practice:

Suppose we have given an application (e.g. ganesha) a delegation. Ceph issues the callback to the application, but it never returns it. What do we do at that point? We could shut down the Fh such that any attempt to use it gives you a -EBADF, but realistically the application is going to be sort of hosed at that point anyway.

The way I see it is that when the application doesn't return the delegation within the expected amount of time, then it has violated the API "contract", and shouldn't be allowed to do anything further without some sort of administrative intervention.

I think it may be best to just evict the client since it's clearly not behaving correctly.

#12 Updated by Greg Farnum almost 2 years ago

I guess I'm not sure what you're going for with the Fb versus Fc here. Sure, if you have Fwb and then get an Fr read capability, I'd expect you to get Fc along with it since obviously you've got a certain level of exclusivity going on. But again, I don't think the protocol promises that behavior.

In terms of the client, I think having it do enforcement of cap timeouts means you can fail more gracefully. If it hits a timeout, the Client can tell the MDS it's quitting and just start returning EIO (or something) on all calls; that's not nicer to the local storage daemon but it's a lot better for the cluster as a whole.

If we were ambitious we could set up recovery interfaces, so that yes — it starts returning EIO or EBADF on the files which failed to return caps quickly enough, but lets others continue and allows a "refresh" on those which were marked bad. But I agree that'd be a lot of work for a situation we oughtn't run into (presumably we can trust the storage daemon plugging into these interfaces).

#13 Updated by Jeff Layton almost 2 years ago

Greg Farnum wrote:

I guess I'm not sure what you're going for with the Fb versus Fc here. Sure, if you have Fwb and then get an Fr read capability, I'd expect you to get Fc along with it since obviously you've got a certain level of exclusivity going on. But again, I don't think the protocol promises that behavior.

I guess my point is that Frwb is effectively equivalent to Frwbc. If you have the ability to buffer writes, then you effectively already have the ability to cache reads as well. I don't see how you could grant the ability to buffer writes but not allow the client to perform reads from that cache.

In terms of the client, I think having it do enforcement of cap timeouts means you can fail more gracefully. If it hits a timeout, the Client can tell the MDS it's quitting and just start returning EIO (or something) on all calls; that's not nicer to the local storage daemon but it's a lot better for the cluster as a whole.

If we were ambitious we could set up recovery interfaces, so that yes — it starts returning EIO or EBADF on the files which failed to return caps quickly enough, but lets others continue and allows a "refresh" on those which were marked bad. But I agree that'd be a lot of work for a situation we oughtn't run into (presumably we can trust the storage daemon plugging into these interfaces).

It should only happen because of a bug in the program, realistically.

Yeah, ok...you do have a good point there. Having to ask the cluster admin to unblacklist your client because your application had a bug in it is burdensome.

Hmmm... maybe it would be best to just have the client abort() when this occurs? If your program crashes, tough break -- fix your bug. Could even allow some mechanism to override that as well (config file option maybe) and just let the cluster evict the client when it occurs.

#14 Updated by Jeff Layton over 1 year ago

I have a couple of patches to start implementing this, but I've not had the time to really do a good job of it. The patches are in the "deleg" branch on my ceph git tree:

https://github.com/jtlayton/ceph/commits/deleg

This is all client-side code. It's still very skeletal and not at all tested. It adds a basic "Delegation" object and has some new interfaces to request and return a delegation. If the client has the right caps, then it will grant the delegation, if not, then you don't get one.

There's still quite a bit of work to be done:

  • probably ought not pass Delegation pointers to the client. Should we hash them and hand the client opaque tokens? That would make it harder for the application to screw things up.
  • when we get a cap recall from the MDS, we need to scan the list of delegations attached to the inode, and issue the callback for each. Probably that should be done asynchronously in some other thread context so as not to squat on the client mutex for too long. Callbacks into applications are a bit of a box of chocolates...
  • what I have so far is for cephfs, but I do wonder if we might be able to add something similar for RGW? Delegations would likely help there as well.
  • once we have a sane interface for cephfs (and maybe RGW) we'll need to teach ganesha how to use it. Last I looked, ganesha's delegation infrastructure could use some love.

#15 Updated by Jeff Layton over 1 year ago

Brief writeup of one way to implement this.

#16 Updated by Jeff Layton over 1 year ago

I've been working on this for the last week or so, so this is a good place to pause and provide an update:

I have a rough draft of this done that does the basic functionality. You can get a r/o or r/w delegation and the appropriate conflicting open access will cause it to be recalled. For local access, the "breaker" waits on the delegation to be returned before proceeding. It works when the conflicting access is via the same client or a different client. I have a testcase that does the same tests in both configurations.

The main work to be done at this point is handling clients that don't return the delegation in a timely fashion. I think I'll probably just create a new sort of SafeTimer Context object and use that to arm a timer that will run a callback if the delegation hasn't been returned.

The big question is what to do in that event. My initial thinking was to just SIGABRT (or maybe just call abort()), but that's potentially rather nasty. Killing the client means that its caps will have to time out. It'd be nicer to allow the client to return everything and shut down cleanly to avoid that. Other options:

  • we could shut down the whole mount -- do an immediate ceph_unmount. I think we need to do some cleanup in this area anyway. Most of the wrappers in libcephfs.cc call cmount->is_mounted return an immediate error if not. Many of the ceph_ll_* ops are missing those checks though. (On a related note, those checks are done outside the client_lock so they are also racy. I think we need to move them inside the mutex regardless.)
  • we could shut down the Fh on which the delegation was acquired, and ensure that any operation that involves it later gets back -EBADF or something. Client would need to close and reopen the Fh to get access again.
  • silently drop the delegation unconditionally. It has been warned, after all, and applications ignore that warning at their own (data corruption) peril

I'm currently weighing these options, and scoping out what we'd need to do to implement each of them.

#17 Updated by Patrick Donnelly over 1 year ago

Jeff Layton wrote:

The main work to be done at this point is handling clients that don't return the delegation in a timely fashion.

here "client" means Ganesha. What about how does Ganesha handle its client not releasing a delegation? Or are we just talking about our response to that failure trickling down to the FSAL?

... I think I'll probably just create a new sort of SafeTimer Context object and use that to arm a timer that will run a callback if the delegation hasn't been returned.

The big question is what to do in that event. My initial thinking was to just SIGABRT (or maybe just call abort()), but that's potentially rather nasty. Killing the client means that its caps will have to time out. It'd be nicer to allow the client to return everything and shut down cleanly to avoid that. Other options:

  • we could shut down the whole mount -- do an immediate ceph_unmount. I think we need to do some cleanup in this area anyway. Most of the wrappers in libcephfs.cc call cmount->is_mounted return an immediate error if not. Many of the ceph_ll_* ops are missing those checks though. (On a related note, those checks are done outside the client_lock so they are also racy. I think we need to move them inside the mutex regardless.)

Please fork an issue for those checks being outside the client_lock so we don't forget.

  • we could shut down the Fh on which the delegation was acquired, and ensure that any operation that involves it later gets back -EBADF or something. Client would need to close and reopen the Fh to get access again.

I don't really like this solution. I'm in favor of all-or-nothing.

  • silently drop the delegation unconditionally. It has been warned, after all, and applications ignore that warning at their own (data corruption) peril

I'm in favor of unmounting everything and doing a clean (as possible) shutdown.

#18 Updated by Jeff Layton over 1 year ago

Patrick Donnelly wrote:

here "client" means Ganesha. What about how does Ganesha handle its client not releasing a delegation? Or are we just talking about our response to that failure trickling down to the FSAL?

Right. Ganesha is the "client" from the ceph standpoint.

Ganesha should also handle NFS clients that fail to return delegations by basically evicting them -- invalidating some or all of their state and letting them know they are in violation when they try to renew their lease. In order to handle this right, we'll need the ceph lease timeout to be longer than the one ganesha will use. Maybe we can add a programmatic interface to allow setting that timer (and maybe to allow overriding the function that gets called when it pops?).

(On a related note, those checks are done outside the client_lock so they are also racy. I think we need to move them inside the mutex regardless.)

Please fork an issue for those checks being outside the client_lock so we don't forget.

Will do. It's actually a bit more complicated. I think we need the existing checks in order to ensure that the ceph_mount_info->client pointer is valid, but I think we also need to check that the client is in a mounted state after we take the mutex. I have a patch that does this, but it needs more testing.

I don't really like this solution. I'm in favor of all-or-nothing.

Fair enough. That way will make it more evident when you're in violation.

I think this is reasonably easy to implement too

#19 Updated by John Spray over 1 year ago

For the clean-ish shutdown case, it would be neat to have a common code path with the -EBLACKLISTED handling (see Client::blacklisted).

I'm not sure how the MDS behaves if a client tries to end its session while it still has some requests waiting (e.g. for locks) -- if existing code paths don't handle that, I'd be inclined to just add a "kill me now!" session code for MClientSession and have the MDS blacklist+evict any client that asks for that.

#20 Updated by Jeff Layton over 1 year ago

The latest set has timeout support that basically does a client->unmount() on the thing. With the patches for this bug, that seems to be enough to cause the application to get an error back on any subsequent access:

http://tracker.ceph.com/issues/21025

I'm open to having the client perform other behaviors here, but this seems like it should do the right thing, and doesn't require any protocol or server-side changes.

#21 Updated by Jeff Layton over 1 year ago

I made some progress today. I got ganesha over ceph to hand out a read delegation. Once I tried to force a recall (by writing to the file from another ceph client), ganesha crashed due to some internal NFSv4 state handling problem. I'm still looking over that piece.

#22 Updated by Jeff Layton over 1 year ago

I was able to get ganesha to hand out a v4.0 delegation today and recall it properly. So, PoC is successful!
There still remains quite a bit of work though:

  • CB_RECALL was never implemented for v4.1. A lot of the machinery we need for them is already there to support CB_LAYOUTRECALL. I'm working on refactoring that code to support both callback types. The ganesha callback code also needs a bit of thread-safety work as it's possible for channel teardowns to race in while we're doing a callback.
  • general cleanup of both the ganesha and ceph series. The patches are pretty ad-hoc today.
  • once we get all of that in place, we should also wire up samba oplock and lease support. That should be fairly simple to do as well.
  • also need to consider write delegations. I think it should be possible, but I'm not yet certain we can make the necessary guarantees in recovery situations.

#23 Updated by Jeff Layton over 1 year ago

Here's a capture showing the delegation grant and recall (what can I say, I'm a proud parent). The delegation was revoked in this case due to me running "echo foo > foo" via ceph-fuse mount. The open call was blocked until the delegation was returned by the NFS client.

#24 Updated by Jeff Layton over 1 year ago

Patrick, Greg and others have been kind enough to give me some good review so far, so I've been working to address those comments. One thing I've noticed though is a subtle difference in how I assumed cap revokes worked and how they actually do.

It turns out that open calls are generally not blocked by a different client holding conflicting caps. In that situation, the MDS will start recalling those caps from the client but it then goes ahead and responds to the open. It only will block when there is actual conflicting access for those caps.

This is problematic for delegation/oplock support for several reasons. What I'd think we may need to add is some way for the ceph client to request that when there are conflicting caps held by another client, that we don't allow the open to succeed and instead return some retryable error (-EAGAIN maybe). Then ganesha could look for that and return NFS4ERR_DELAY back to the NFS client, so that it can redrive the OPEN from scratch.

Unfortunately, that means a protocol change (though it may be as simple as a new CEPH_O_DELEG_SYNC flag for the open routines). I also need some way to answer the question in the MDS:

"Given a CInode, do any other clients hold caps that would conflict with the ones that this open requires?"

I think that's CInode::get_caps_issued and pass it the cap mask from ceph_caps_for_mode ?

#25 Updated by Patrick Donnelly over 1 year ago

Jeff Layton wrote:

Can you list them? Is it just we want Ganesha to return NFS4ERR_DELAY?

Yes, that's the big one. When we can't satisfy a call right away, it's generally better to let the client re-drive it rather than stalling a thread on the server. We'll probably eventually want to do the same thing for some namespace ops too -- renames, unlinks, etc.

Also share/deny modes would be a bit hard to handle too.

I also worry that we'll end up with cache coherency problems but I haven't yet crafted a scenario where this would truly break anything.

Perhaps a solution is to have Ganesha do an fstat() after open?

I don't think that'll work, as I don't think an fstat would conflict with a read delegation. They both only deal with shared caps.

Also, I don't think we are interested in giving the client extra caps if there is conflicting access out there. It's possible for ganesha to open a file and never do a bit of I/O to it. I think a way for the client to ask the MDS to ensure that no client holds conflicting caps out there, and to just return -EAGAIN or whatever until there isn't.

#26 Updated by Jeff Layton about 1 year ago

  • Status changed from New to Resolved

Patches merged into both ceph and ganesha for this.

#27 Updated by Patrick Donnelly about 1 year ago

  • Subject changed from implement delegation support in userland cephfs to client: implement delegation support in userland cephfs
  • Status changed from Resolved to Pending Backport
  • Backport set to luminous

Thanks for remembering to update this ticket Jeff. We need to backport this for Luminous as this is needed for 3.0.

Merged PR: https://github.com/ceph/ceph/pull/18274

#28 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #22407: luminous: client: implement delegation support in userland cephfs added

#29 Updated by Nathan Cutler 12 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF