Project

General

Profile

Actions

Bug #11218

closed

Assertion on MDS rank `in` but without instance

Added by John Spray about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Unintended consequence of an MDS in 'damaged' state: a peer in a multi-MDS environment is confused that the MDS is 'in' but does not have an associated daemon. Shown up by test_journal_repair, which will also need updating to not expect a crash.

2015-03-23 16:37:28.125925 7f92d77a9700 -1 mds/MDSMap.h: In function 'const entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f92d77a9700 time 2015-03-23 16:37:28.122743
mds/MDSMap.h: 559: FAILED assert(up.count(m))

 ceph version 0.93-776-g0a3e47d (0a3e47d778b457ae878024f95f610b0a8c2fb490)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0x97db2f]
 2: (MDBalancer::send_heartbeat()+0x163f) [0x7641ef]
 3: (MDBalancer::tick()+0x22a) [0x76cbaa]
 4: (MDS::tick()+0x364) [0x5b4444]
 5: (MDSInternalContextBase::complete(int)+0x1db) [0x7f1b6b]
 6: (SafeTimer::timer_thread()+0x3e5) [0x96f615]
 7: (SafeTimerThread::entry()+0xd) [0x9701ad]
 8: (()+0x7e9a) [0x7f92df97ae9a]
 9: (clone()+0x6d) [0x7f92de3412ed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #1

Updated by Greg Farnum about 9 years ago

Hmm. Do we need to keep the damaged ranks as members of the "up" set, or could we do something simple to remove them from that? I like these asserts since a damaged MDS isn't participating and so isn't really "up".

Actions #2

Updated by John Spray about 9 years ago

  • Status changed from In Progress to Fix Under Review

It turns out it was already find for an MDS to be 'in' but have no inst (it's the case when we do "ceph mds fail"), but it was supposed to be impossible in this fn because of an is_degraded() check at the start, and that check wasn't checking damaged)

https://github.com/ceph/ceph/pull/4192

Actions #3

Updated by Greg Farnum about 9 years ago

  • Status changed from Fix Under Review to Resolved

Merged to master in commit:bd1d11f6eb8225c996bfc7ca00a2083cb9423b51

Actions

Also available in: Atom PDF