Bug #1137
closed
Added by Damien Churchill almost 13 years ago.
Updated over 7 years ago.
Description
2011-06-03 11:41:30.568740 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds8 10.10.20.9:6800/3398 6184 ==== mds_resolve(1+0 subtrees +0 slave requests) v1 ==== 28+0+0 (400933449 0 0) 0x7fed4002da40 con 0x7fed40001140
2011-06-03 11:41:30.568762 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds7 10.10.20.10:6800/1577 99 ==== mds_resolve(1+0 subtrees +0 slave requests) v1 ==== 28+0+0 (4263357877 0 0) 0x7fed3822e380 con 0x2285a10
2011-06-03 11:41:30.568776 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds4 10.10.20.7:6800/1553 4 ==== mds_resolve(4+0 subtrees +0 slave requests) v1 ==== 196+0+0 (1349260996 0 0) 0x229eaa0 con 0x220a9d0
2011-06-03 11:41:30.568819 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds2 10.10.20.11:6800/2970 1 ==== mdsmap(e 1265) v1 ==== 2531+0+0 (569664708 0 0) 0x2322d50 con 0x7fed30200fb0
2011-06-03 11:41:30.568834 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds2 10.10.20.11:6800/2970 2 ==== mds_resolve(2+0 subtrees +0 slave requests) v1 ==== 92+0+0 (1981521396 0 0) 0x22b4e90 con 0x7fed30200fb0
*** Caught signal (Segmentation fault) **
in thread 0x7fed4a157700
ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
1: /usr/bin/cmds() [0x740979]
2: (()+0xfc60) [0x7fed4c396c60]
3: (MDCache::get_subtree_root(CDir*)+0x7) [0x554ac7]
4: (MDCache::adjust_bounded_subtree_auth(CDir*, std::set<CDir*, std::less<CDir*>, std::allocator<CDir*> >&, std::pair<int, int>)+0x692) [0x575b72]
5: (MDCache::handle_resolve(MMDSResolve*)+0x6c5) [0x583015]
6: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
7: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
8: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
9: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
10: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
11: (()+0x6d8c) [0x7fed4c38dd8c]
12: (clone()+0x6d) [0x7fed4b24004d]
- Category set to 1
- Assignee set to Sage Weil
- Target version set to v0.30
does this happen each time you try to start cmds?
If so, can you add
debug mds = 20
debug ms = 1
to [mds] section of ceph.conf, reproduce, and attach the logs?
If there are cmds instances that are still running, also do
ceph mds tell \* injectargs '--debug-mds 20 --debug-ms 1'
and attach those logs too.
Thanks!
Unfortunately that was just a one off crash. I have just set debug-mds = 20 in the ceph configuration now though. I'm failing to bring up the cluster at the moment. In fact I'm getting a new crash on every start up by the looks of things.
mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)', in thread '0x7fa7d38b1700'
mds/MDCache.cc: 3522: FAILED assert(dnl->is_primary())
ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
1: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9]
2: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b]
3: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
4: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
5: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
6: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
8: (()+0x6d8c) [0x7fa7d5ae7d8c]
9: (clone()+0x6d) [0x7fa7d499a04d]
ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
1: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9]
2: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b]
3: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
4: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
5: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
6: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
8: (()+0x6d8c) [0x7fa7d5ae7d8c]
9: (clone()+0x6d) [0x7fa7d499a04d]
*** Caught signal (Aborted) **
in thread 0x7fa7d38b1700
ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
1: /usr/bin/cmds() [0x740979]
2: (()+0xfc60) [0x7fa7d5af0c60]
3: (gsignal()+0x35) [0x7fa7d48e7d05]
4: (abort()+0x186) [0x7fa7d48ebab6]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa7d519e6dd]
6: (()+0xb9926) [0x7fa7d519c926]
7: (()+0xb9953) [0x7fa7d519c953]
8: (()+0xb9a5e) [0x7fa7d519ca5e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x362) [0x721f02]
10: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9]
11: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b]
12: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
13: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
14: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
15: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
17: (()+0x6d8c) [0x7fa7d5ae7d8c]
18: (clone()+0x6d) [0x7fa7d499a04d]
I'll attach some logs too shortly.
Unfortunately after adding the debug to the config the crash stopped occurring which is a nuisance.
- Status changed from New to Can't reproduce
If this turns up again, let us know! I suspect it may be related to the rename journaling changes; I'll be testing fsstress vs mds restart to turn up any issues there.
- Project changed from Ceph to CephFS
- Category deleted (
1)
- Target version deleted (
v0.30)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.
Also available in: Atom
PDF