Project

General

Profile

Feature #9755

Fence late clients during reconnect timeout

Added by John Spray over 9 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

During reconnect, MDSs terminate the sessions of any clients which fail to reconnect within the window. Because when we terminate a client session we potentially hand off capabilities that it held to other clients, we must fence the client whose session we terminated. MDSs should fence clients that fail to reconnect, using the mechanism from #9754


Related issues

Related to CephFS - Feature #9940: uclient: be more robust when dealing with outstanding RADOS IO and stale caps New 10/29/2014

History

#1 Updated by Greg Farnum over 9 years ago

Hmm, I like the basic thrust of this, but I'm a little concerned as well — we have other tickets to let clients reconnect outside of the window. (I hope the reasons for desiring that are obvious — forcing disconnects without allowing some kind of reconnect is a heck of a user problem.) If we're blacklisting the clients this becomes a lot more difficult, because they won't be able to continue OSD ops without reconnecting and losing their identity.

#2 Updated by John Spray over 9 years ago

There can be certain cases where a client can reconnect after being evicted, e.g. if:

  • the client didn't hold any write capabilities: it can safely be permitted to rejoin the filesystem, although in the interim it might have shown bogus data to userspace if it thought it had a valid read cache of something that we had actually granted elsewhere.
  • the caps held by this client were not in demand by anyone else, so we were able to hold onto them and subsequently re-issue them

Those are special cases though: the general case is that clients who miss the reconnect window must be fenced. The default reconnect timeout in the MDS is rather low (60s iirc), and we may want to revise that upwards when putting a more aggressive client eviction in place.

There is also the question of how a client should respond to being blacklisted: to help users out, we might want a mechanism for client mounts to discard their existing global ID and acquire a new one when they are blacklisted, and then continue to operate normally after cancelling all buffers (effectively as if they had been remounted).

#3 Updated by Greg Farnum over 8 years ago

  • Assignee set to John Spray

Didn't this get done when the epoch barrier stuff did? (If not, please unassign.)

#4 Updated by John Spray over 8 years ago

  • Assignee deleted (John Spray)

Nope -- the machinery went in to barrier on OSD epoch after blacklisting a client, but the actual act of blacklisting remains the user's problem.

Funnily enough I was just thinking about where to implement the CLI thing for doing a all-MDS client eviction. It's not an especially complicated series of 'tell's and mon commands that could be implemented as a piece of python code, although not sure we necessarily want to burden the existing ceph.in code with such logic. Might make sense to have a client management tool of some kind that knows how to relate user requests about clients to the appropriate commands to multiple MDSs, so that they could also e.g. see the stats for a given client from multiple MDSs, which MDSs it has a session on, etc.

#5 Updated by Greg Farnum over 7 years ago

  • Category set to Correctness/Safety

#6 Updated by Patrick Donnelly over 5 years ago

  • Status changed from New to Resolved

This has been corrected but this issue was never closed.

Also available in: Atom PDF