Project

General

Profile

Bug #1256

mds dies on ESHUTDOWN under too-high mon load (time-outs?)

Added by Alexandre Oliva about 8 years ago. Updated about 8 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
07/04/2011
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

It's relatively common for the active mds to die while I run backups from/to the filesystems that hold the mon data in btrfs subvolumes.

Some debugging shows osdc/Objecter.cc:

739 int rc = m->get_result();

gets -108 stored in rc, on stack level #17, which is then passed down to levels #16 and #15 before failing an assertion in osdc/Filer.cc:

48 assert(r == 0);

#15 0x00000000006d5f34 in Filer::C_Probe::finish (this=<value optimized out>,
r=<value optimized out>) at osdc/Filer.cc:48
#16 0x00000000006d3a3e in Objecter::C_Stat::finish (this=0x169e740, r=-108)
at osdc/Objecter.h:367
#17 0x00000000006b7c57 in Objecter::handle_osd_op_reply (this=0x15d0240,
m=0x16908c0) at osdc/Objecter.cc:796
#18 0x00000000004c70ff in MDS::handle_core_message (this=0x15f8500,
m=0x16908c0) at mds/MDS.cc:1688

I've been experiencing this for a while, but the above is on 0.30, and it was the first time I started looking into why the mds crashed. No further debugging so far.

History

#1 Updated by Sage Weil about 8 years ago

  • Target version set to v0.32

#2 Updated by Greg Farnum about 8 years ago

  • Status changed from New to Won't Fix

ESHUTDOWN is also EBLACKLISTED for the Ceph project. So the MDS is timing out on its heartbeats and the mon is killing it, and the MDS is finding out by communicating with an OSD and getting rejected.
If you think there's something going on here that shouldn't be (mon incorrectly blacklisting, timeouts aren't because of real load but a bug, etc), give us more information and we can look into it. But without something more, I think the system's doing what it should. :)

(You could also turn up the timeout limit -- the MDS defaults to sending a beacon every 4 seconds, and the monitor will kill it if it doesn't get a beacon for 15. You might try extending that period and see if it works better for you.)

#3 Updated by Greg Farnum about 8 years ago

Although I did make some adjustments so hopefully it won't produce a core dump anymore in the latest unstable.

Also available in: Atom PDF