Actions
Bug #10382
closedmds/MDS.cc: In function 'void MDS::heartbeat_reset()
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
While running a Active/Standby set of MDSes I see this happen quite often when stopping the Active MDS:
0> 2014-12-18 17:49:16.663686 7f62b03ce700 -1 mds/MDS.cc: In function 'void MDS::heartbeat_reset()' thread 7f62b03ce700 time 2014-12-18 17:49:16.660035 mds/MDS.cc: 2694: FAILED assert(hb != __null) ceph version 0.89 (68fdc0f68e6a04e283d2c5140832a3175b4f9840) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x91d80b] 2: /usr/bin/ceph-mds() [0x58f602] 3: (MDS::ms_dispatch(Message*)+0x2d) [0x5a72dd] 4: (DispatchQueue::entry()+0x649) [0x9fb589] 5: (DispatchQueue::DispatchThread::entry()+0xd) [0x90751d] 6: (()+0x8182) [0x7f62b594d182] 7: (clone()+0x6d) [0x7f62b40bcefd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Looking at the code I see this:
void MDS::heartbeat_reset() { assert(hb != NULL); // NB not enabling suicide grace, because the mon takes care of killing us // (by blacklisting us) when we fail to send beacons, and it's simpler to // only have one way of dying. cct->get_heartbeat_map()->reset_timeout(hb, g_conf->mds_beacon_grace, 0); }
The comment says the monitor should blacklist the MDS.
In this case the whole cluster is running v0.87, but only the MDS is running v0.89. Could that be the issue?
Actions