Project

General

Profile

Actions

Bug #4390

closed

mds: zapping named mds causes client assertion

Added by Sam Lang about 11 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hit the following assertion on the client with backtrace testing:

../../src/mds/MDSMap.h: In function 'const entity_inst_t MDSMap::get_inst(int)' thread 7f565b911700 time 2013-03-07 17:32:14.584546
../../src/mds/MDSMap.h: 466: FAILED assert(up.count(m))
ceph version 0.56-1060-gf907468 (f907468bbadf129a66c9bf07b854eec2beca1a2d)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x7f565e2ef101]
2: (MDSMap::get_inst(int)+0x51) [0x7f565e11f837]
3: (Client::send_cap(Inode*, int, Cap*, int, int, int, int)+0xa48) [0x7f565e0df554]
4: (Client::check_caps(Inode*, bool)+0xc3c) [0x7f565e0e02b4]
5: (Client::tick()+0x41b) [0x7f565e0ef20d]
6: (C_C_Tick::finish(int)+0x1f) [0x7f565e12724f]
7: (SafeTimer::timer_thread()+0x36b) [0x7f565e2dea61]
8: (SafeTimerThread::entry()+0x1c) [0x7f565e2dff14]
9: (Thread::_entry_func(void*)+0x23) [0x7f565e2dbef5]
10: (()+0x7e9a) [0x7f5673658e9a]
11: (clone()+0x6d) [0x7f5672a6bcbd]

The problem seems to be in the unique name enforcement code (2e112333). A beacon comes in from the new mds and zaps the old mds from the mdsmap up list, but the new mds isn't added to the up list itself until the next tick. This results in a window where the mdsmap can have no members in up, that instance of the map is sent to the client, and the client hits the above assertion.

Actions #1

Updated by Sam Lang about 11 years ago

Proposed fix in wip-4390. Should we also cleanup the client code to wait till the mdsmap contains up members? Separate bug?

Actions #2

Updated by Sam Lang about 11 years ago

  • Status changed from New to Fix Under Review
Actions #3

Updated by Sam Lang about 11 years ago

That approach was breaking the monitor. Just pushed a new approach that queues the zap for later.

Actions #4

Updated by Sage Weil about 11 years ago

pushed wip-4390-b, which solves this on the client side.

i don't really want to delay the mark-down/failing in the mon because then we have 2 mons of hte same name for a brief period, which confuses the invariant(s). could push the standby into the prepare_beacon method, but there is a lot of other logic there that would need to be dpulicated, so that's annoying too. maybe some other time.

in the meantime, the client should be able to handle this situation, if only because the user can type 'ceph mds fail 0'. the branch cleans up a ton of code surrounding the mds session handling and also migrates to using a Connection* at the same time. this should resolve the original symptom.

will run it through the fs suite as soon as gitbuilder catches up

Actions #5

Updated by Sage Weil about 11 years ago

  • Assignee changed from Sam Lang to Sage Weil
Actions #6

Updated by Sage Weil about 11 years ago

ran this through the fs suite and it passed. i would expect breakage in mds thrashing and multimds situations, though; no coveraged for that yet.

Actions #7

Updated by Sage Weil about 11 years ago

  • Priority changed from Normal to High
Actions #8

Updated by Sage Weil about 11 years ago

  • Status changed from Fix Under Review to Resolved

commit:f67596a44739e8071cc97fb0463f37203502faaa

Actions #9

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF