Bug #1137: MDS Crash - CephFS - Ceph

Actions

Copy link

Bug #1137

closed

MDS Crash

Added by Damien Churchill almost 13 years ago. Updated over 7 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2011-06-03 11:41:30.568740 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds8 10.10.20.9:6800/3398 6184 ==== mds_resolve(1+0 subtrees +0 slave requests) v1 ==== 28+0+0 (400933449 0 0) 0x7fed4002da40 con 0x7fed40001140
2011-06-03 11:41:30.568762 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds7 10.10.20.10:6800/1577 99 ==== mds_resolve(1+0 subtrees +0 slave requests) v1 ==== 28+0+0 (4263357877 0 0) 0x7fed3822e380 con 0x2285a10
2011-06-03 11:41:30.568776 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds4 10.10.20.7:6800/1553 4 ==== mds_resolve(4+0 subtrees +0 slave requests) v1 ==== 196+0+0 (1349260996 0 0) 0x229eaa0 con 0x220a9d0
2011-06-03 11:41:30.568819 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds2 10.10.20.11:6800/2970 1 ==== mdsmap(e 1265) v1 ==== 2531+0+0 (569664708 0 0) 0x2322d50 con 0x7fed30200fb0
2011-06-03 11:41:30.568834 7fed4a157700 -- 10.10.20.3:6800/1520 <== mds2 10.10.20.11:6800/2970 2 ==== mds_resolve(2+0 subtrees +0 slave requests) v1 ==== 92+0+0 (1981521396 0 0) 0x22b4e90 con 0x7fed30200fb0
*** Caught signal (Segmentation fault) **
 in thread 0x7fed4a157700
 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
 1: /usr/bin/cmds() [0x740979]
 2: (()+0xfc60) [0x7fed4c396c60]
 3: (MDCache::get_subtree_root(CDir*)+0x7) [0x554ac7]
 4: (MDCache::adjust_bounded_subtree_auth(CDir*, std::set<CDir*, std::less<CDir*>, std::allocator<CDir*> >&, std::pair<int, int>)+0x692) [0x575b72]
 5: (MDCache::handle_resolve(MMDSResolve*)+0x6c5) [0x583015]
 6: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
 7: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
 8: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
 9: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
 10: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
 11: (()+0x6d8c) [0x7fed4c38dd8c]
 12: (clone()+0x6d) [0x7fed4b24004d]

Actions

Copy link

Updated by Sage Weil almost 13 years ago

Category set to 1
Assignee set to Sage Weil
Target version set to v0.30

does this happen each time you try to start cmds?

If so, can you add
debug mds = 20
debug ms = 1
to [mds] section of ceph.conf, reproduce, and attach the logs?

If there are cmds instances that are still running, also do

ceph mds tell \* injectargs '--debug-mds 20 --debug-ms 1'

and attach those logs too.

Thanks!

Actions

Copy link

Updated by Damien Churchill almost 13 years ago

Unfortunately that was just a one off crash. I have just set debug-mds = 20 in the ceph configuration now though. I'm failing to bring up the cluster at the moment. In fact I'm getting a new crash on every start up by the looks of things.

mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)', in thread '0x7fa7d38b1700'
mds/MDCache.cc: 3522: FAILED assert(dnl->is_primary())
 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
 1: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9]
 2: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b]
 3: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
 4: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
 5: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
 6: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
 8: (()+0x6d8c) [0x7fa7d5ae7d8c]
 9: (clone()+0x6d) [0x7fa7d499a04d]
 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
 1: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9]
 2: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b]
 3: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
 4: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
 5: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
 6: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
 8: (()+0x6d8c) [0x7fa7d5ae7d8c]
 9: (clone()+0x6d) [0x7fa7d499a04d]
*** Caught signal (Aborted) **
 in thread 0x7fa7d38b1700
 ceph version 0.28.2 (commit:23242045db6b0ec87400441acbe0ea14eedbe6cc)
 1: /usr/bin/cmds() [0x740979]
 2: (()+0xfc60) [0x7fa7d5af0c60]
 3: (gsignal()+0x35) [0x7fa7d48e7d05]
 4: (abort()+0x186) [0x7fa7d48ebab6]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa7d519e6dd]
 6: (()+0xb9926) [0x7fa7d519c926]
 7: (()+0xb9953) [0x7fa7d519c953]
 8: (()+0xb9a5e) [0x7fa7d519ca5e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x362) [0x721f02]
 10: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1af9) [0x5a17d9]
 11: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x15b) [0x5a253b]
 12: (MDS::handle_deferrable_message(Message*)+0x5df) [0x4ca4bf]
 13: (MDS::_dispatch(Message*)+0x11d2) [0x4d5c52]
 14: (MDS::ms_dispatch(Message*)+0x6d) [0x4d627d]
 15: (SimpleMessenger::dispatch_entry()+0x667) [0x4ad1c7]
 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a0bbc]
 17: (()+0x6d8c) [0x7fa7d5ae7d8c]
 18: (clone()+0x6d) [0x7fa7d499a04d]

I'll attach some logs too shortly.

Actions

Copy link

Updated by Damien Churchill almost 13 years ago

Unfortunately after adding the debug to the config the crash stopped occurring which is a nuisance.

Actions

Copy link

Updated by Sage Weil almost 13 years ago

Status changed from New to Can't reproduce

If this turns up again, let us know! I suspect it may be related to the rename journaling changes; I'll be testing fsstress vs mds restart to turn up any issues there.

Actions

Copy link

Updated by John Spray over 7 years ago

Project changed from Ceph to CephFS
Category deleted (1)
Target version deleted (~~v0.30~~)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #1137

MDS Crash

Updated by Sage Weil almost 13 years ago

Updated by Damien Churchill almost 13 years ago

Updated by Damien Churchill almost 13 years ago

Updated by Sage Weil almost 13 years ago

Updated by John Spray over 7 years ago