Bug #4390
closedmds: zapping named mds causes client assertion
0%
Description
Hit the following assertion on the client with backtrace testing:
../../src/mds/MDSMap.h: In function 'const entity_inst_t MDSMap::get_inst(int)' thread 7f565b911700 time 2013-03-07 17:32:14.584546
../../src/mds/MDSMap.h: 466: FAILED assert(up.count(m))
ceph version 0.56-1060-gf907468 (f907468bbadf129a66c9bf07b854eec2beca1a2d)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x7f565e2ef101]
2: (MDSMap::get_inst(int)+0x51) [0x7f565e11f837]
3: (Client::send_cap(Inode*, int, Cap*, int, int, int, int)+0xa48) [0x7f565e0df554]
4: (Client::check_caps(Inode*, bool)+0xc3c) [0x7f565e0e02b4]
5: (Client::tick()+0x41b) [0x7f565e0ef20d]
6: (C_C_Tick::finish(int)+0x1f) [0x7f565e12724f]
7: (SafeTimer::timer_thread()+0x36b) [0x7f565e2dea61]
8: (SafeTimerThread::entry()+0x1c) [0x7f565e2dff14]
9: (Thread::_entry_func(void*)+0x23) [0x7f565e2dbef5]
10: (()+0x7e9a) [0x7f5673658e9a]
11: (clone()+0x6d) [0x7f5672a6bcbd]
The problem seems to be in the unique name enforcement code (2e112333). A beacon comes in from the new mds and zaps the old mds from the mdsmap up list, but the new mds isn't added to the up list itself until the next tick. This results in a window where the mdsmap can have no members in up, that instance of the map is sent to the client, and the client hits the above assertion.
Updated by Sam Lang about 11 years ago
Proposed fix in wip-4390. Should we also cleanup the client code to wait till the mdsmap contains up members? Separate bug?
Updated by Sam Lang about 11 years ago
- Status changed from New to Fix Under Review
Updated by Sam Lang about 11 years ago
That approach was breaking the monitor. Just pushed a new approach that queues the zap for later.
Updated by Sage Weil about 11 years ago
pushed wip-4390-b, which solves this on the client side.
i don't really want to delay the mark-down/failing in the mon because then we have 2 mons of hte same name for a brief period, which confuses the invariant(s). could push the standby into the prepare_beacon method, but there is a lot of other logic there that would need to be dpulicated, so that's annoying too. maybe some other time.
in the meantime, the client should be able to handle this situation, if only because the user can type 'ceph mds fail 0'. the branch cleans up a ton of code surrounding the mds session handling and also migrates to using a Connection* at the same time. this should resolve the original symptom.
will run it through the fs suite as soon as gitbuilder catches up
Updated by Sage Weil about 11 years ago
- Assignee changed from Sam Lang to Sage Weil
Updated by Sage Weil about 11 years ago
ran this through the fs suite and it passed. i would expect breakage in mds thrashing and multimds situations, though; no coveraged for that yet.
Updated by Sage Weil about 11 years ago
- Status changed from Fix Under Review to Resolved
commit:f67596a44739e8071cc97fb0463f37203502faaa