Bug #4390
closed
mds: zapping named mds causes client assertion
Added by Sam Lang about 11 years ago.
Updated almost 8 years ago.
Description
Hit the following assertion on the client with backtrace testing:
../../src/mds/MDSMap.h: In function 'const entity_inst_t MDSMap::get_inst(int)' thread 7f565b911700 time 2013-03-07 17:32:14.584546
../../src/mds/MDSMap.h: 466: FAILED assert(up.count(m))
ceph version 0.56-1060-gf907468 (f907468bbadf129a66c9bf07b854eec2beca1a2d)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x7f565e2ef101]
2: (MDSMap::get_inst(int)+0x51) [0x7f565e11f837]
3: (Client::send_cap(Inode*, int, Cap*, int, int, int, int)+0xa48) [0x7f565e0df554]
4: (Client::check_caps(Inode*, bool)+0xc3c) [0x7f565e0e02b4]
5: (Client::tick()+0x41b) [0x7f565e0ef20d]
6: (C_C_Tick::finish(int)+0x1f) [0x7f565e12724f]
7: (SafeTimer::timer_thread()+0x36b) [0x7f565e2dea61]
8: (SafeTimerThread::entry()+0x1c) [0x7f565e2dff14]
9: (Thread::_entry_func(void*)+0x23) [0x7f565e2dbef5]
10: (()+0x7e9a) [0x7f5673658e9a]
11: (clone()+0x6d) [0x7f5672a6bcbd]
The problem seems to be in the unique name enforcement code (2e112333). A beacon comes in from the new mds and zaps the old mds from the mdsmap up list, but the new mds isn't added to the up list itself until the next tick. This results in a window where the mdsmap can have no members in up, that instance of the map is sent to the client, and the client hits the above assertion.
Proposed fix in wip-4390. Should we also cleanup the client code to wait till the mdsmap contains up members? Separate bug?
- Status changed from New to Fix Under Review
That approach was breaking the monitor. Just pushed a new approach that queues the zap for later.
pushed wip-4390-b, which solves this on the client side.
i don't really want to delay the mark-down/failing in the mon because then we have 2 mons of hte same name for a brief period, which confuses the invariant(s). could push the standby into the prepare_beacon method, but there is a lot of other logic there that would need to be dpulicated, so that's annoying too. maybe some other time.
in the meantime, the client should be able to handle this situation, if only because the user can type 'ceph mds fail 0'. the branch cleans up a ton of code surrounding the mds session handling and also migrates to using a Connection* at the same time. this should resolve the original symptom.
will run it through the fs suite as soon as gitbuilder catches up
- Assignee changed from Sam Lang to Sage Weil
ran this through the fs suite and it passed. i would expect breakage in mds thrashing and multimds situations, though; no coveraged for that yet.
- Priority changed from Normal to High
- Status changed from Fix Under Review to Resolved
commit:f67596a44739e8071cc97fb0463f37203502faaa
Also available in: Atom
PDF