Project

General

Profile

Actions

Bug #634

closed

Kernel client takes too long to recover after a MDS restart

Added by Ravi Pinjala over 13 years ago. Updated over 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

[208292.940934] libceph: mds0 192.168.1.11:6800 socket closed
[208293.050282] libceph: mds0 192.168.1.11:6800 connection failed
[208343.050057] ceph: mds0 caps stale
[208358.050075] ceph: mds0 caps stale
[208545.126700] ceph: mds0 reconnect start
[208545.280853] ceph: mds0 reconnect success
[208546.581244] ceph: mds0 recovery completed

This is after restarting the MDS (so that the daemon came back up after a few seconds). Note the timestamps - the kernel client waited several minutes to attempt a reconnect. During this time, all I/O operations were hanging, until it reconnected (at which point everything worked).

I guess we don't want to barrage the server with connections if something more permanent happens to the MDS, so some kind of bounded exponential backoff might be appropriate here.

Actions

Also available in: Atom PDF