Bug #634
closedKernel client takes too long to recover after a MDS restart
0%
Description
[208292.940934] libceph: mds0 192.168.1.11:6800 socket closed
[208293.050282] libceph: mds0 192.168.1.11:6800 connection failed
[208343.050057] ceph: mds0 caps stale
[208358.050075] ceph: mds0 caps stale
[208545.126700] ceph: mds0 reconnect start
[208545.280853] ceph: mds0 reconnect success
[208546.581244] ceph: mds0 recovery completed
This is after restarting the MDS (so that the daemon came back up after a few seconds). Note the timestamps - the kernel client waited several minutes to attempt a reconnect. During this time, all I/O operations were hanging, until it reconnected (at which point everything worked).
I guess we don't want to barrage the server with connections if something more permanent happens to the MDS, so some kind of bounded exponential backoff might be appropriate here.
Updated by Sage Weil over 13 years ago
The client doesn't 'reconnect' until the MDS reaches the up:reconnect state. That's preceeded by up:replay (journal replay), which may take many seconds, and up:resolve (which should be very fast). You might check the mdsmap timestamps (ceph mds dump -o - [epoch #]) on past epochs and look at the mtime to see how long the replay and resolve stages took.
Updated by Greg Farnum over 13 years ago
It's also possible (though unlikely) that the client isn't getting an updated MDSMap quickly enough or that the MDS timeouts got broken somehow.
I mention this just because in my experience MDS reconnects take closer to 90 seconds than 4 minutes, so let's figure out where all the time is going!
Updated by Sage Weil over 13 years ago
- Status changed from New to Can't reproduce