Bug #10382: mds/MDS.cc: In function 'void MDS::heartbeat_reset() - CephFS - Ceph

Actions

Copy link

Bug #10382

closed

mds/MDS.cc: In function 'void MDS::heartbeat_reset()

Added by Wido den Hollander over 9 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

John Spray

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

While running a Active/Standby set of MDSes I see this happen quite often when stopping the Active MDS:

     0> 2014-12-18 17:49:16.663686 7f62b03ce700 -1 mds/MDS.cc: In function 'void MDS::heartbeat_reset()' thread 7f62b03ce700 time 2014-12-18 17:49:16.660035
mds/MDS.cc: 2694: FAILED assert(hb != __null)

 ceph version 0.89 (68fdc0f68e6a04e283d2c5140832a3175b4f9840)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x91d80b]
 2: /usr/bin/ceph-mds() [0x58f602]
 3: (MDS::ms_dispatch(Message*)+0x2d) [0x5a72dd]
 4: (DispatchQueue::entry()+0x649) [0x9fb589]
 5: (DispatchQueue::DispatchThread::entry()+0xd) [0x90751d]
 6: (()+0x8182) [0x7f62b594d182]
 7: (clone()+0x6d) [0x7f62b40bcefd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Looking at the code I see this:

void MDS::heartbeat_reset()
{
assert(hb != NULL);
// NB not enabling suicide grace, because the mon takes care of killing us
// (by blacklisting us) when we fail to send beacons, and it's simpler to
// only have one way of dying.
cct->get_heartbeat_map()->reset_timeout(hb, g_conf->mds_beacon_grace, 0);
}

The comment says the monitor should blacklist the MDS.

In this case the whole cluster is running v0.87, but only the MDS is running v0.89. Could that be the issue?