Bug #9216: mds may regard active clients as stale due to slow pg recovery - CephFS - Ceph

Actions

Copy link

Bug #9216

open

mds may regard active clients as stale due to slow pg recovery

Added by Alexandre Oliva over 9 years ago. Updated almost 8 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

5 - suggestion

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Common/Protocol, MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I occasionally get fuse and ceph.ko mounts into weird states, and I can generally track them down to the mds's deciding that those clients were stale even though they were not. Most often, the mds crashes shortly after that, and sometimes the stale-but-not-really clients succeed in reconnecting before the reconnect window closes, and this makes them all right most of the time. However, I've recently observed a situation in which the mds survived, and then the ceph.ko clients would attempt to reconnect and be denied because the mds was already active.

Anyway, the primary cause of all this pain appears to be the slow recovery metadata PGs after an osd times out or some such, and more importantly the fact that the mds does not appear to take into acount pending messages and its own stuck-waiting-for-PGs status before regarding a client session as stale. I think the mds should extend the stale-session time-out counter when it is itself laggy or failing to journal any progress.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #9216

mds may regard active clients as stale due to slow pg recovery

Updated by Greg Farnum over 9 years ago

Updated by Alexandre Oliva over 9 years ago

Updated by Zheng Yan over 9 years ago

Updated by Greg Farnum almost 8 years ago