Bug #3593
closedMDS crash in MDCache.cc _recovered()
0%
Description
While rsyncing to cephfs, the active mds frequently crashes. Attached is the tail of the logfile of one of them.
Files
Updated by Sam Lang over 11 years ago
This looks like the objecter is trying to send and getting the ESHUTDOWN error code, because the mds tries to reconnect after going past the mds_beacon_grace period of 15 seconds. The ESHUTDOWN is propagating back up into the mds recovery code before hitting the assertion. Could this be related to the recent pipe changes?
Updated by Greg Farnum over 11 years ago
This is nothing to do with the pipe changes. It's getting ESHUTDOWN back from the OSD, not out of the local pipe — we use ESHUTDOWN to mean EBLACKLISTED; ie, the MDS didn't heartbeat the monitors and so they shut it down and the OSDs can't talk to it any more.
I see the MDS did try to suicide (so it's recognizing the error correctly and responding as it should) and I thought that was a semi-clean shutdown, but maybe it isn't. The assert and core dump is a little rude, but it's going to turn off anyway. We should make it handle this more politely, but it's not something that we need to make happen before Bobtail.
As suggested, if this is happening repeatedly then the right thing to do is increase the mds_beacon_grace and increase how long the MDS can go before it's considered to have failed. Its inability to send those beacons under load is going to be another thing we need to look into...
Updated by Matthew Via over 11 years ago
At the moment I can't do any more to debug this (until this evening), but I have set the beacon grace to 120 seconds and still run into this problem with about the same frequency.
Updated by Greg Farnum over 11 years ago
Oh, I see. It does indeed start with a bunch of broken pipe messages between the MDS and monitor. Can you post more of the log before the end? Is it possible that you've actually got network issues between the boxes?
Updated by Matthew Via over 11 years ago
Here is a piece of log from an mds dying with ms=1 and mds=20: https://pastee.org/hf3cy
Here's another from another mds dying shortly after the first: https://pastee.org/z64hz
There are no networking issues that I can tell. One issue a few days ago I had was that I'd allowed 6800:6820 through the firewall, but the daemons die so frequently that the port number it binded to grew beyond that, but that is no longer the case.
Updated by Greg Farnum over 11 years ago
Both of these logs show a respawn because the MDS got removed from the map (generally, for not heartbeating). That again is not just expected but polite behavior. If you can get me a full log (not just the tail) we can look at when it's sending out beacons and check if that's correct.
Updated by Sage Weil almost 11 years ago
- Status changed from New to Can't reproduce