Bug #38326: mds: evict stale client when one of its write caps are stolen - CephFS - Ceph

Actions

Copy link

Bug #38326

closed

mds: evict stale client when one of its write caps are stolen

Added by Patrick Donnelly about 5 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Zheng Yan

Category:

Correctness/Safety

Target version:

Ceph - v15.0.0

% Done:

Source:

Development

Tags:

Backport:

nautilus,mimic

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

task(hard)

Pull request ID:

26737

Crash signature (v1):

Crash signature (v2):

Description

IIUC: After mdsmap.session_time, the current behavior is that a stale session's caps' issued set is revoked and changed to CEPH_CAP_PIN. The MDS allows that stale session to later come back and "resume" by updating its cap.want set, which causes it to obtain new caps in the normal fashion.

One issue with this is that a client may be writing to a file, becomes unresponsive, and another client successfully begins (buffered) writing to that file concurrently. The only correct thing to do when another client comes along wanting that write cap is to evict the unresponsive client. Eviction is absolutely necessary because (a) we don't know when or if the client is coming back and (b) if the client is still connected to RADOS and writing bytes to the file while unable to receive/process messages from the MDS.

I would tentatively propose that the new behavior should be for a stale session:

(a) mark the session stale and check whether any locks are blocked by the newly "stale" caps. Ideally, we shouldn't invalidate a cap unnecessarily. (Why would we want to? Then the client needs to get the cap reissued which is expensive?)
(b) if a client comes along trying to obtain a conflicting WR/BUFFER/EXCL cap, evict the stale session immediately, wait for the osdmap update, then issue the cap.

If a stale session comes back, we can reissue most caps it already had because no other session has stolen its write caps. An exception is CEPH_CAP_GCACHE which may have been lost by an intervening write by another client.

Simple reproducer of the original problem with two clients:

[client 1] mkdir foo && pv -L 1K < /dev/urandom > foo/bar
kill -STOP <client1>
[client 2] pv -L 1K < /dev/urandom > foo/bar # client 2 blocks for 60s then gets write caps!
kill -CONT <client1> # both writes continue without buffer cap

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Patrick Donnelly about 5 years ago

Description updated (diff)

Actions

Copy link

Updated by Zheng Yan about 5 years ago

Status changed from 12 to Fix Under Review
Pull request ID set to 26737

Actions

Copy link

Updated by Patrick Donnelly about 5 years ago

Target version changed from v14.0.0 to v15.0.0

Actions

Copy link

Updated by Patrick Donnelly about 5 years ago

Subject changed from mds: evict stale client when one of its write caps are stole to mds: evict stale client when one of its write caps are stolen
Priority changed from Normal to Urgent
Backport set to nautilus,mimic

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Status changed from Fix Under Review to Pending Backport

Zheng, any issues backporting this?

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Copied to Backport #40326: nautilus: mds: evict stale client when one of its write caps are stolen added

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Copied to Backport #40327: mimic: mds: evict stale client when one of its write caps are stolen added

Actions

Copy link

Updated by Zheng Yan almost 5 years ago

Status changed from Pending Backport to Fix Under Review

increment patches https://github.com/ceph/ceph/pull/28642

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#10

Updated by Nathan Cutler over 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #38326

mds: evict stale client when one of its write caps are stolen

Updated by Patrick Donnelly about 5 years ago

Updated by Zheng Yan about 5 years ago

Updated by Patrick Donnelly about 5 years ago

Updated by Patrick Donnelly about 5 years ago

Updated by Patrick Donnelly almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Zheng Yan almost 5 years ago

Updated by Patrick Donnelly almost 5 years ago

Updated by Nathan Cutler over 4 years ago