Project

General

Profile

Actions

Bug #10382

closed

mds/MDS.cc: In function 'void MDS::heartbeat_reset()

Added by Wido den Hollander over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While running a Active/Standby set of MDSes I see this happen quite often when stopping the Active MDS:

     0> 2014-12-18 17:49:16.663686 7f62b03ce700 -1 mds/MDS.cc: In function 'void MDS::heartbeat_reset()' thread 7f62b03ce700 time 2014-12-18 17:49:16.660035
mds/MDS.cc: 2694: FAILED assert(hb != __null)

 ceph version 0.89 (68fdc0f68e6a04e283d2c5140832a3175b4f9840)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x91d80b]
 2: /usr/bin/ceph-mds() [0x58f602]
 3: (MDS::ms_dispatch(Message*)+0x2d) [0x5a72dd]
 4: (DispatchQueue::entry()+0x649) [0x9fb589]
 5: (DispatchQueue::DispatchThread::entry()+0xd) [0x90751d]
 6: (()+0x8182) [0x7f62b594d182]
 7: (clone()+0x6d) [0x7f62b40bcefd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Looking at the code I see this:

void MDS::heartbeat_reset()
{
assert(hb != NULL);
// NB not enabling suicide grace, because the mon takes care of killing us
// (by blacklisting us) when we fail to send beacons, and it's simpler to
// only have one way of dying.
cct->get_heartbeat_map()->reset_timeout(hb, g_conf->mds_beacon_grace, 0);
}

The comment says the monitor should blacklist the MDS.

In this case the whole cluster is running v0.87, but only the MDS is running v0.89. Could that be the issue?

Actions

Also available in: Atom PDF