Feature #9755
closedFence late clients during reconnect timeout
0%
Description
During reconnect, MDSs terminate the sessions of any clients which fail to reconnect within the window. Because when we terminate a client session we potentially hand off capabilities that it held to other clients, we must fence the client whose session we terminated. MDSs should fence clients that fail to reconnect, using the mechanism from #9754
Updated by Greg Farnum over 9 years ago
Hmm, I like the basic thrust of this, but I'm a little concerned as well — we have other tickets to let clients reconnect outside of the window. (I hope the reasons for desiring that are obvious — forcing disconnects without allowing some kind of reconnect is a heck of a user problem.) If we're blacklisting the clients this becomes a lot more difficult, because they won't be able to continue OSD ops without reconnecting and losing their identity.
Updated by John Spray over 9 years ago
There can be certain cases where a client can reconnect after being evicted, e.g. if:
- the client didn't hold any write capabilities: it can safely be permitted to rejoin the filesystem, although in the interim it might have shown bogus data to userspace if it thought it had a valid read cache of something that we had actually granted elsewhere.
- the caps held by this client were not in demand by anyone else, so we were able to hold onto them and subsequently re-issue them
Those are special cases though: the general case is that clients who miss the reconnect window must be fenced. The default reconnect timeout in the MDS is rather low (60s iirc), and we may want to revise that upwards when putting a more aggressive client eviction in place.
There is also the question of how a client should respond to being blacklisted: to help users out, we might want a mechanism for client mounts to discard their existing global ID and acquire a new one when they are blacklisted, and then continue to operate normally after cancelling all buffers (effectively as if they had been remounted).
Updated by Greg Farnum almost 9 years ago
- Assignee set to John Spray
Didn't this get done when the epoch barrier stuff did? (If not, please unassign.)
Updated by John Spray almost 9 years ago
- Assignee deleted (
John Spray)
Nope -- the machinery went in to barrier on OSD epoch after blacklisting a client, but the actual act of blacklisting remains the user's problem.
Funnily enough I was just thinking about where to implement the CLI thing for doing a all-MDS client eviction. It's not an especially complicated series of 'tell's and mon commands that could be implemented as a piece of python code, although not sure we necessarily want to burden the existing ceph.in code with such logic. Might make sense to have a client management tool of some kind that knows how to relate user requests about clients to the appropriate commands to multiple MDSs, so that they could also e.g. see the stats for a given client from multiple MDSs, which MDSs it has a session on, etc.
Updated by Patrick Donnelly over 5 years ago
- Status changed from New to Resolved
This has been corrected but this issue was never closed.