Bug #634: Kernel client takes too long to recover after a MDS restart - Linux kernel client - Ceph

Actions

Copy link

Bug #634

closed

Kernel client takes too long to recover after a MDS restart

Added by Ravi Pinjala over 13 years ago. Updated over 13 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

[208292.940934] libceph: mds0 192.168.1.11:6800 socket closed
[208293.050282] libceph: mds0 192.168.1.11:6800 connection failed
[208343.050057] ceph: mds0 caps stale
[208358.050075] ceph: mds0 caps stale
[208545.126700] ceph: mds0 reconnect start
[208545.280853] ceph: mds0 reconnect success
[208546.581244] ceph: mds0 recovery completed

This is after restarting the MDS (so that the daemon came back up after a few seconds). Note the timestamps - the kernel client waited several minutes to attempt a reconnect. During this time, all I/O operations were hanging, until it reconnected (at which point everything worked).

I guess we don't want to barrage the server with connections if something more permanent happens to the MDS, so some kind of bounded exponential backoff might be appropriate here.

Actions

Copy link

Updated by Sage Weil over 13 years ago

The client doesn't 'reconnect' until the MDS reaches the up:reconnect state. That's preceeded by up:replay (journal replay), which may take many seconds, and up:resolve (which should be very fast). You might check the mdsmap timestamps (ceph mds dump -o - [epoch #]) on past epochs and look at the mtime to see how long the replay and resolve stages took.

Actions

Copy link

Updated by Greg Farnum over 13 years ago

It's also possible (though unlikely) that the client isn't getting an updated MDSMap quickly enough or that the MDS timeouts got broken somehow.
I mention this just because in my experience MDS reconnects take closer to 90 seconds than 4 minutes, so let's figure out where all the time is going!

Actions

Copy link

Updated by Sage Weil over 13 years ago

Status changed from New to Can't reproduce

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #634

Kernel client takes too long to recover after a MDS restart

Updated by Sage Weil over 13 years ago

Updated by Greg Farnum over 13 years ago

Updated by Sage Weil over 13 years ago