Project

General

Profile

Actions

Bug #24802

open

races with nfs-ganesha reboots and delegation handling

Added by Jeff Layton almost 6 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, Ganesha FSAL, libcephfs, mgr/nfs
Labels (FS):
NFS-cluster, task(hard)
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

So I've come up with a thought experiment that I think could be problematic for ganesha with delegations enabled. This scenario assumes current behavior where we do not "drain off" in-progress NFS RPCs before reporting that the grace period is being enforced:

Ganesha 1               Ganesha 2
---------               ---------
get delegation
                        block trying to get caps covered by delegation (either in client or on MDS)
crash and restart
                        start enforcing grace, set enforcing flag
startup proceeds
kill off old state
                        blocked operation proceeds since caps are now freed

The last bit (blocked operation proceeding) occurs while a client of server 1 still technically holds a delegation. Which is a violation of basic tenets of this stuff.

We could fix this by "draining off" in progress RPCs before we report that we're enforcing, but that would just result in a deadlock in this situation. Ganesha-1 would be stuck at startup and never release its state. I think we probably do want to implement some sort of call draining like that, but we need to resolve the potential for deadlock here too.

What we'd like to happen here is for ganesha to return a retryable error on operations that will be blocked waiting on a delegation to be returned (i.e. NFS4ERR_GRACE or NFS4ERR_DELAY). The client can then wait a bit and redrive the thing. We don't have support for this in ceph though, and we'd need it.

One idea: define a new set of operations (or set some field in the session) that says that we won't block for "too long" waiting on caps. If we can't get the caps, give up and carry on elsewhere. That would probably fix cases where the Ceph client is blocked on a CapGet, but the MDS can also end up blocked gathering caps. I'm not sure what we can do there.

Thoughts?

Actions

Also available in: Atom PDF