Project

General

Profile

Actions

Bug #634

closed

Kernel client takes too long to recover after a MDS restart

Added by Ravi Pinjala over 13 years ago. Updated over 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

[208292.940934] libceph: mds0 192.168.1.11:6800 socket closed
[208293.050282] libceph: mds0 192.168.1.11:6800 connection failed
[208343.050057] ceph: mds0 caps stale
[208358.050075] ceph: mds0 caps stale
[208545.126700] ceph: mds0 reconnect start
[208545.280853] ceph: mds0 reconnect success
[208546.581244] ceph: mds0 recovery completed

This is after restarting the MDS (so that the daemon came back up after a few seconds). Note the timestamps - the kernel client waited several minutes to attempt a reconnect. During this time, all I/O operations were hanging, until it reconnected (at which point everything worked).

I guess we don't want to barrage the server with connections if something more permanent happens to the MDS, so some kind of bounded exponential backoff might be appropriate here.

Actions #1

Updated by Sage Weil over 13 years ago

The client doesn't 'reconnect' until the MDS reaches the up:reconnect state. That's preceeded by up:replay (journal replay), which may take many seconds, and up:resolve (which should be very fast). You might check the mdsmap timestamps (ceph mds dump -o - [epoch #]) on past epochs and look at the mtime to see how long the replay and resolve stages took.

Actions #2

Updated by Greg Farnum over 13 years ago

It's also possible (though unlikely) that the client isn't getting an updated MDSMap quickly enough or that the MDS timeouts got broken somehow.
I mention this just because in my experience MDS reconnects take closer to 90 seconds than 4 minutes, so let's figure out where all the time is going!

Actions #3

Updated by Sage Weil over 13 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF