Bug #1137
closedMDS Crash
0%
Description
2011-06-03 11:41:30.568740 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds8 10.10.20.9:6800/3398 6184 ==== mds_resolve(1+0 subtrees +0 slave requests) v1 ==== 28+0+0 (400933449 0 0) 0x7fed4002da40 con 0x7fed40001140 2011-06-03 11:41:30.568762 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds7 10.10.20.10:6800/1577 99 ==== mds_resolve(1+0 subtrees +0 slave requests) v1 ==== 28+0+0 (4263357877 0 0) 0x7fed3822e380 con 0x2285a10 2011-06-03 11:41:30.568776 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds4 10.10.20.7:6800/1553 4 ==== mds_resolve(4+0 subtrees +0 slave requests) v1 ==== 196+0+0 (1349260996 0 0) 0x229eaa0 con 0x220a9d0 2011-06-03 11:41:30.568819 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds2 10.10.20.11:6800/2970 1 ==== mdsmap(e 1265) v1 ==== 2531+0+0 (569664708 0 0) 0x2322d50 con 0x7fed30200fb0 2011-06-03 11:41:30.568834 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds2 10.10.20.11:6800/2970 2 ==== mds_resolve(2+0 subtrees +0 slave requests) v1 ==== 92+0+0 (1981521396 0 0) 0x22b4e90 con 0x7fed30200fb0 *** Caught signal (Segmentation fault) ** in thread 0x7fed4a157700 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc) 1: /usr/bin/cmds() [0x740979] 2: (()+0xfc60) [0x7fed4c396c60] 3: (MDCache::get_subtree_root(CDir*)+0x7) [0x554ac7] 4: (MDCache::adjust_bounded_subtree_auth(CDir*, std::set<CDir*, std::less<CDir*>, std::allocator<CDir*> >&, std::pair<int, int>)+0x692) [0x575b72] 5: (MDCache::handle_resolve(MMDSResolve*)+0x6c5) [0x583015] 6: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf] 7: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52] 8: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d] 9: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7] 10: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc] 11: (()+0x6d8c) [0x7fed4c38dd8c] 12: (clone()+0x6d) [0x7fed4b24004d]
Updated by Sage Weil almost 13 years ago
- Category set to 1
- Assignee set to Sage Weil
- Target version set to v0.30
does this happen each time you try to start cmds?
If so, can you add
debug mds = 20
debug ms = 1
to [mds] section of ceph.conf, reproduce, and attach the logs?
If there are cmds instances that are still running, also do
ceph mds tell \* injectargs '--debug-mds 20 --debug-ms 1'
and attach those logs too.
Thanks!
Updated by Damien Churchill almost 13 years ago
Unfortunately that was just a one off crash. I have just set debug-mds = 20 in the ceph configuration now though. I'm failing to bring up the cluster at the moment. In fact I'm getting a new crash on every start up by the looks of things.
mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)', in thread '0x7fa7d38b1700' mds/MDCache.cc: 3522: FAILED assert(dnl->is_primary()) ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc) 1: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9] 2: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b] 3: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf] 4: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52] 5: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d] 6: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7] 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc] 8: (()+0x6d8c) [0x7fa7d5ae7d8c] 9: (clone()+0x6d) [0x7fa7d499a04d] ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc) 1: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9] 2: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b] 3: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf] 4: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52] 5: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d] 6: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7] 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc] 8: (()+0x6d8c) [0x7fa7d5ae7d8c] 9: (clone()+0x6d) [0x7fa7d499a04d] *** Caught signal (Aborted) ** in thread 0x7fa7d38b1700 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc) 1: /usr/bin/cmds() [0x740979] 2: (()+0xfc60) [0x7fa7d5af0c60] 3: (gsignal()+0x35) [0x7fa7d48e7d05] 4: (abort()+0x186) [0x7fa7d48ebab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa7d519e6dd] 6: (()+0xb9926) [0x7fa7d519c926] 7: (()+0xb9953) [0x7fa7d519c953] 8: (()+0xb9a5e) [0x7fa7d519ca5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x362) [0x721f02] 10: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9] 11: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b] 12: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf] 13: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52] 14: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d] 15: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7] 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc] 17: (()+0x6d8c) [0x7fa7d5ae7d8c] 18: (clone()+0x6d) [0x7fa7d499a04d]
I'll attach some logs too shortly.
Updated by Damien Churchill almost 13 years ago
Unfortunately after adding the debug to the config the crash stopped occurring which is a nuisance.
Updated by Sage Weil almost 13 years ago
- Status changed from New to Can't reproduce
If this turns up again, let us know! I suspect it may be related to the rename journaling changes; I'll be testing fsstress vs mds restart to turn up any issues there.
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1) - Target version deleted (
v0.30)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.