Project

General

Profile

Actions

Bug #1137

closed

MDS Crash

Added by Damien Churchill almost 13 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2011-06-03 11:41:30.568740 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds8 10.10.20.9:6800/3398 6184 ==== mds_resolve(1+0 subtrees +0 slave requests) v1 ==== 28+0+0 (400933449 0 0) 0x7fed4002da40 con 0x7fed40001140
2011-06-03 11:41:30.568762 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds7 10.10.20.10:6800/1577 99 ==== mds_resolve(1+0 subtrees +0 slave requests) v1 ==== 28+0+0 (4263357877 0 0) 0x7fed3822e380 con 0x2285a10
2011-06-03 11:41:30.568776 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds4 10.10.20.7:6800/1553 4 ==== mds_resolve(4+0 subtrees +0 slave requests) v1 ==== 196+0+0 (1349260996 0 0) 0x229eaa0 con 0x220a9d0
2011-06-03 11:41:30.568819 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds2 10.10.20.11:6800/2970 1 ==== mdsmap(e 1265) v1 ==== 2531+0+0 (569664708 0 0) 0x2322d50 con 0x7fed30200fb0
2011-06-03 11:41:30.568834 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds2 10.10.20.11:6800/2970 2 ==== mds_resolve(2+0 subtrees +0 slave requests) v1 ==== 92+0+0 (1981521396 0 0) 0x22b4e90 con 0x7fed30200fb0
*** Caught signal (Segmentation fault) **
 in thread 0x7fed4a157700
 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
 1: /usr/bin/cmds() [0x740979]
 2: (()+0xfc60) [0x7fed4c396c60]
 3: (MDCache::get_subtree_root(CDir*)+0x7) [0x554ac7]
 4: (MDCache::adjust_bounded_subtree_auth(CDir*, std::set<CDir*, std::less<CDir*>, std::allocator<CDir*> >&, std::pair<int, int>)+0x692) [0x575b72]
 5: (MDCache::handle_resolve(MMDSResolve*)+0x6c5) [0x583015]
 6: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
 7: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
 8: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
 9: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
 10: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
 11: (()+0x6d8c) [0x7fed4c38dd8c]
 12: (clone()+0x6d) [0x7fed4b24004d]
Actions #1

Updated by Sage Weil almost 13 years ago

  • Category set to 1
  • Assignee set to Sage Weil
  • Target version set to v0.30

does this happen each time you try to start cmds?

If so, can you add
debug mds = 20
debug ms = 1
to [mds] section of ceph.conf, reproduce, and attach the logs?

If there are cmds instances that are still running, also do

ceph mds tell \* injectargs '--debug-mds 20 --debug-ms 1'

and attach those logs too.

Thanks!

Actions #2

Updated by Damien Churchill almost 13 years ago

Unfortunately that was just a one off crash. I have just set debug-mds = 20 in the ceph configuration now though. I'm failing to bring up the cluster at the moment. In fact I'm getting a new crash on every start up by the looks of things.

mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)', in thread '0x7fa7d38b1700'
mds/MDCache.cc: 3522: FAILED assert(dnl->is_primary())
 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
 1: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9]
 2: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b]
 3: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
 4: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
 5: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
 6: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
 8: (()+0x6d8c) [0x7fa7d5ae7d8c]
 9: (clone()+0x6d) [0x7fa7d499a04d]
 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
 1: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9]
 2: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b]
 3: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
 4: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
 5: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
 6: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
 8: (()+0x6d8c) [0x7fa7d5ae7d8c]
 9: (clone()+0x6d) [0x7fa7d499a04d]
*** Caught signal (Aborted) **
 in thread 0x7fa7d38b1700
 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
 1: /usr/bin/cmds() [0x740979]
 2: (()+0xfc60) [0x7fa7d5af0c60]
 3: (gsignal()+0x35) [0x7fa7d48e7d05]
 4: (abort()+0x186) [0x7fa7d48ebab6]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa7d519e6dd]
 6: (()+0xb9926) [0x7fa7d519c926]
 7: (()+0xb9953) [0x7fa7d519c953]
 8: (()+0xb9a5e) [0x7fa7d519ca5e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x362) [0x721f02]
 10: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9]
 11: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b]
 12: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
 13: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
 14: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
 15: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
 17: (()+0x6d8c) [0x7fa7d5ae7d8c]
 18: (clone()+0x6d) [0x7fa7d499a04d]

I'll attach some logs too shortly.

Actions #3

Updated by Damien Churchill almost 13 years ago

Unfortunately after adding the debug to the config the crash stopped occurring which is a nuisance.

Actions #4

Updated by Sage Weil almost 13 years ago

  • Status changed from New to Can't reproduce

If this turns up again, let us know! I suspect it may be related to the rename journaling changes; I'll be testing fsstress vs mds restart to turn up any issues there.

Actions #5

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.30)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF