mds dies on ESHUTDOWN under too-high mon load (time-outs?)
It's relatively common for the active mds to die while I run backups from/to the filesystems that hold the mon data in btrfs subvolumes.
Some debugging shows osdc/Objecter.cc:
739 int rc = m->get_result();
48 assert(r == 0);
#15 0x00000000006d5f34 in Filer::C_Probe::finish (this=<value optimized out>,
r=<value optimized out>) at osdc/Filer.cc:48
#16 0x00000000006d3a3e in Objecter::C_Stat::finish (this=0x169e740, r=-108)
#17 0x00000000006b7c57 in Objecter::handle_osd_op_reply (this=0x15d0240,
m=0x16908c0) at osdc/Objecter.cc:796
#18 0x00000000004c70ff in MDS::handle_core_message (this=0x15f8500,
m=0x16908c0) at mds/MDS.cc:1688
I've been experiencing this for a while, but the above is on 0.30, and it was the first time I started looking into why the mds crashed. No further debugging so far.
#2 Updated by Greg Farnum over 8 years ago
- Status changed from New to Won't Fix
ESHUTDOWN is also EBLACKLISTED for the Ceph project. So the MDS is timing out on its heartbeats and the mon is killing it, and the MDS is finding out by communicating with an OSD and getting rejected.
If you think there's something going on here that shouldn't be (mon incorrectly blacklisting, timeouts aren't because of real load but a bug, etc), give us more information and we can look into it. But without something more, I think the system's doing what it should. :)
(You could also turn up the timeout limit -- the MDS defaults to sending a beacon every 4 seconds, and the monitor will kill it if it doesn't get a beacon for 15. You might try extending that period and see if it works better for you.)