Project

General

Profile

Actions

Bug #9216

open

mds may regard active clients as stale due to slow pg recovery

Added by Alexandre Oliva over 9 years ago. Updated almost 8 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
5 - suggestion
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Common/Protocol, MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I occasionally get fuse and ceph.ko mounts into weird states, and I can generally track them down to the mds's deciding that those clients were stale even though they were not. Most often, the mds crashes shortly after that, and sometimes the stale-but-not-really clients succeed in reconnecting before the reconnect window closes, and this makes them all right most of the time. However, I've recently observed a situation in which the mds survived, and then the ceph.ko clients would attempt to reconnect and be denied because the mds was already active.

Anyway, the primary cause of all this pain appears to be the slow recovery metadata PGs after an osd times out or some such, and more importantly the fact that the mds does not appear to take into acount pending messages and its own stuck-waiting-for-PGs status before regarding a client session as stale. I think the mds should extend the stale-session time-out counter when it is itself laggy or failing to journal any progress.

Actions

Also available in: Atom PDF