Bug #634
closedKernel client takes too long to recover after a MDS restart
0%
Description
[208292.940934] libceph: mds0 192.168.1.11:6800 socket closed
[208293.050282] libceph: mds0 192.168.1.11:6800 connection failed
[208343.050057] ceph: mds0 caps stale
[208358.050075] ceph: mds0 caps stale
[208545.126700] ceph: mds0 reconnect start
[208545.280853] ceph: mds0 reconnect success
[208546.581244] ceph: mds0 recovery completed
This is after restarting the MDS (so that the daemon came back up after a few seconds). Note the timestamps - the kernel client waited several minutes to attempt a reconnect. During this time, all I/O operations were hanging, until it reconnected (at which point everything worked).
I guess we don't want to barrage the server with connections if something more permanent happens to the MDS, so some kind of bounded exponential backoff might be appropriate here.