Project

General

Profile

Actions

Bug #10382

closed

mds/MDS.cc: In function 'void MDS::heartbeat_reset()

Added by Wido den Hollander over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While running a Active/Standby set of MDSes I see this happen quite often when stopping the Active MDS:

     0> 2014-12-18 17:49:16.663686 7f62b03ce700 -1 mds/MDS.cc: In function 'void MDS::heartbeat_reset()' thread 7f62b03ce700 time 2014-12-18 17:49:16.660035
mds/MDS.cc: 2694: FAILED assert(hb != __null)

 ceph version 0.89 (68fdc0f68e6a04e283d2c5140832a3175b4f9840)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x91d80b]
 2: /usr/bin/ceph-mds() [0x58f602]
 3: (MDS::ms_dispatch(Message*)+0x2d) [0x5a72dd]
 4: (DispatchQueue::entry()+0x649) [0x9fb589]
 5: (DispatchQueue::DispatchThread::entry()+0xd) [0x90751d]
 6: (()+0x8182) [0x7f62b594d182]
 7: (clone()+0x6d) [0x7f62b40bcefd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Looking at the code I see this:

void MDS::heartbeat_reset()
{
assert(hb != NULL);
// NB not enabling suicide grace, because the mon takes care of killing us
// (by blacklisting us) when we fail to send beacons, and it's simpler to
// only have one way of dying.
cct->get_heartbeat_map()->reset_timeout(hb, g_conf->mds_beacon_grace, 0);
}

The comment says the monitor should blacklist the MDS.

In this case the whole cluster is running v0.87, but only the MDS is running v0.89. Could that be the issue?

Actions #1

Updated by John Spray over 9 years ago

  • Status changed from New to In Progress
  • Assignee set to John Spray
Actions #2

Updated by Wido den Hollander over 9 years ago

I tried to reproduce this today with debug_mds set to 10 and 20, but I wasn't able to reproduce it at that moment.

I don't have access to the cluster anymore where this happend on, so I can't test it any further.

Actions #3

Updated by Samuel Just over 9 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
Actions #4

Updated by John Spray over 9 years ago

  • Status changed from In Progress to Fix Under Review
Actions #5

Updated by John Spray over 9 years ago

This will need backport to giant

Actions #6

Updated by Greg Farnum about 9 years ago

  • Status changed from Fix Under Review to Pending Backport

Can you prepare a backport branch please?

Actions #8

Updated by Greg Farnum about 9 years ago

  • Status changed from Pending Backport to Resolved

Thanks!

Actions

Also available in: Atom PDF