Bug #3593: MDS crash in MDCache.cc _recovered() - Ceph - Ceph

Actions

Copy link

Bug #3593

closed

MDS crash in MDCache.cc _recovered()

Added by Matthew Via over 11 years ago. Updated almost 11 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

While rsyncing to cephfs, the active mds frequently crashes. Attached is the tail of the logfile of one of them.

Files

Download all files

alpha.log (55 KB) alpha.log		Matthew Via, 12/08/2012 10:40 AM
ceph.conf (4.52 KB) ceph.conf	config file for ceph	Matthew Via, 12/08/2012 10:42 AM

Actions

Copy link

Updated by Matthew Via over 11 years ago

File ceph.conf ceph.conf added

Actions

Copy link

Updated by Sam Lang over 11 years ago

This looks like the objecter is trying to send and getting the ESHUTDOWN error code, because the mds tries to reconnect after going past the mds_beacon_grace period of 15 seconds. The ESHUTDOWN is propagating back up into the mds recovery code before hitting the assertion. Could this be related to the recent pipe changes?

Actions

Copy link

Updated by Greg Farnum over 11 years ago

This is nothing to do with the pipe changes. It's getting ESHUTDOWN back from the OSD, not out of the local pipe — we use ESHUTDOWN to mean EBLACKLISTED; ie, the MDS didn't heartbeat the monitors and so they shut it down and the OSDs can't talk to it any more.
I see the MDS did try to suicide (so it's recognizing the error correctly and responding as it should) and I thought that was a semi-clean shutdown, but maybe it isn't. The assert and core dump is a little rude, but it's going to turn off anyway. We should make it handle this more politely, but it's not something that we need to make happen before Bobtail.

As suggested, if this is happening repeatedly then the right thing to do is increase the mds_beacon_grace and increase how long the MDS can go before it's considered to have failed. Its inability to send those beacons under load is going to be another thing we need to look into...

Actions

Copy link

Updated by Matthew Via over 11 years ago

At the moment I can't do any more to debug this (until this evening), but I have set the beacon grace to 120 seconds and still run into this problem with about the same frequency.

Actions

Copy link

Updated by Greg Farnum over 11 years ago

Oh, I see. It does indeed start with a bunch of broken pipe messages between the MDS and monitor. Can you post more of the log before the end? Is it possible that you've actually got network issues between the boxes?

Actions

Copy link

Updated by Matthew Via over 11 years ago

Here is a piece of log from an mds dying with ms=1 and mds=20: https://pastee.org/hf3cy
Here's another from another mds dying shortly after the first: https://pastee.org/z64hz

There are no networking issues that I can tell. One issue a few days ago I had was that I'd allowed 6800:6820 through the firewall, but the daemons die so frequently that the port number it binded to grew beyond that, but that is no longer the case.

Actions

Copy link

Updated by Greg Farnum over 11 years ago

Both of these logs show a respawn because the MDS got removed from the map (generally, for not heartbeating). That again is not just expected but polite behavior. If you can get me a full log (not just the tail) we can look at when it's sending out beacons and check if that's correct.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Status changed from New to Can't reproduce

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3593

MDS crash in MDCache.cc _recovered()

Updated by Matthew Via over 11 years ago

Updated by Sam Lang over 11 years ago

Updated by Greg Farnum over 11 years ago

Updated by Matthew Via over 11 years ago

Updated by Greg Farnum over 11 years ago

Updated by Matthew Via over 11 years ago

Updated by Greg Farnum over 11 years ago

Updated by Sage Weil almost 11 years ago