Project

General

Profile

Actions

Bug #1535

closed

concurrent creating and removing directories crashes cmds

Added by John Leach over 12 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

setup two clients with a mounted ceph filesystem, had one creating a hierarchy of empty directories in a loop and the other deleting them in a loop.

got two mds servers, one crashed:

2011-09-12 18:02:35.414589 7f1c719b4700 mds0.migrator nicely exporting to mds1 [dir 10000010809 /dirs/ [2,head] auth{1=1} pv=3421 v=3417 cv=0/0 ap=2+2+3 state=1610612738|complete f(v6 m2011-09-12 17:40:09.473169 4=0+4) n(v214 rc2011-09-12 18:02:34.405528 1260=0+1260)/n(v214 rc2011-09-12 18:02:34.347155 1261=0+1261) hs=4+1,ss=0+0 dirty=4 | child replicated dirty authpin 0x122395b8]
2011-09-12 18:02:45.183404 7f1c719b4700 mds0.bal   mds0 mdsload<[79.2128,0 79.2128]/[7.27142,0 7.27142], req 0, hr 0, qlen 0, cpu 0.25> = 64.8246 ~ 79.2128
2011-09-12 18:02:45.183454 7f1c719b4700 mds0.bal   mds1 mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.33> = 532.38 ~ 650.545
mds/MDCache.cc: In function 'void MDCache::handle_dentry_link(MDentryLink*)', in thread '0x7f1c719b4700'
mds/MDCache.cc: 9213: FAILED assert(dn)
 ceph version  (commit:)
 1: (MDCache::handle_dentry_link(MDentryLink*)+0x361) [0x56a1c1]
 2: (MDCache::dispatch(Message*)+0x175) [0x5aced5]
 3: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f]
 4: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5]
 5: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1]
 6: (SimpleMessenger::dispatch_entry()+0x879) [0x722769]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c]
 8: (()+0x69ca) [0x7f1c752339ca]
 9: (clone()+0x6d) [0x7f1c73cb870d]
 ceph version  (commit:)
 1: (MDCache::handle_dentry_link(MDentryLink*)+0x361) [0x56a1c1]
 2: (MDCache::dispatch(Message*)+0x175) [0x5aced5]
 3: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f]
 4: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5]
 5: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1]
 6: (SimpleMessenger::dispatch_entry()+0x879) [0x722769]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c]
 8: (()+0x69ca) [0x7f1c752339ca]
 9: (clone()+0x6d) [0x7f1c73cb870d]
*** Caught signal (Aborted) **
 in thread 0x7f1c719b4700
 ceph version  (commit:)
 1: /usr/bin/cmds() [0x794e74]
 2: (()+0xf8f0) [0x7f1c7523c8f0]
 3: (gsignal()+0x35) [0x7f1c73c05a75]
 4: (abort()+0x180) [0x7f1c73c095c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f1c744bb8e5]
 6: (()+0xcad16) [0x7f1c744b9d16]
 7: (()+0xcad43) [0x7f1c744b9d43]
 8: (()+0xcae3e) [0x7f1c744b9e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x39f) [0x716cdf]
 10: (MDCache::handle_dentry_link(MDentryLink*)+0x361) [0x56a1c1]
 11: (MDCache::dispatch(Message*)+0x175) [0x5aced5]
 12: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f]
 13: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5]
 14: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1]
 15: (SimpleMessenger::dispatch_entry()+0x879) [0x722769]
 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c]
 17: (()+0x69ca) [0x7f1c752339ca]
 18: (clone()+0x6d) [0x7f1c73cb870d]

The filesystem hung on both servers.

I restarted cmds on the crashed node and it crashed again during startup:

2011-09-12 18:15:36.018540 7ffb8e9c3700 mds0.3 reconnect_done
2011-09-12 18:15:36.027241 7ffb8e9c3700 mds0.3 handle_mds_map i am now mds0.3
2011-09-12 18:15:36.027259 7ffb8e9c3700 mds0.3 handle_mds_map state change up:reconnect --> up:rejoin
2011-09-12 18:15:36.027266 7ffb8e9c3700 mds0.3 rejoin_joint_start
2011-09-12 18:15:36.018540 7ffb8e9c3700 mds0.3 reconnect_done
2011-09-12 18:15:36.027241 7ffb8e9c3700 mds0.3 handle_mds_map i am now mds0.3
2011-09-12 18:15:36.027259 7ffb8e9c3700 mds0.3 handle_mds_map state change up:reconnect --> up:rejoin
2011-09-12 18:15:36.027266 7ffb8e9c3700 mds0.3 rejoin_joint_start
mds/MDCache.cc: In function 'CDir* MDCache::rejoin_invent_dirfrag(dirfrag_t)', in thread '0x7ffb8e9c3700'
mds/MDCache.cc: 3937: FAILED assert(in->is_dir())
 ceph version  (commit:)
 1: (MDCache::rejoin_invent_dirfrag(dirfrag_t)+0x219) [0x570a89]
 2: (MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)+0x418e) [0x590b0e]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x1b3) [0x5a55f3]
 4: (MDCache::dispatch(Message*)+0x105) [0x5ace65]
 5: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f]
 6: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5]
 7: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1]
 8: (SimpleMessenger::dispatch_entry()+0x879) [0x722769]
 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c]
 10: (()+0x69ca) [0x7ffb922429ca]
 11: (clone()+0x6d) [0x7ffb90cc770d]
 ceph version  (commit:)
 1: (MDCache::rejoin_invent_dirfrag(dirfrag_t)+0x219) [0x570a89]
 2: (MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)+0x418e) [0x590b0e]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x1b3) [0x5a55f3]
 4: (MDCache::dispatch(Message*)+0x105) [0x5ace65]
 5: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f]
 6: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5]
 7: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1]
 8: (SimpleMessenger::dispatch_entry()+0x879) [0x722769]
 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c]
 10: (()+0x69ca) [0x7ffb922429ca]
 11: (clone()+0x6d) [0x7ffb90cc770d]
*** Caught signal (Aborted) **
 in thread 0x7ffb8e9c3700
 ceph version  (commit:)
 1: /usr/bin/cmds() [0x794e74]
 2: (()+0xf8f0) [0x7ffb9224b8f0]
 3: (gsignal()+0x35) [0x7ffb90c14a75]
 4: (abort()+0x180) [0x7ffb90c185c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7ffb914ca8e5]
 6: (()+0xcad16) [0x7ffb914c8d16]
 7: (()+0xcad43) [0x7ffb914c8d43]
 8: (()+0xcae3e) [0x7ffb914c8e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x39f) [0x716cdf]
 10: (MDCache::rejoin_invent_dirfrag(dirfrag_t)+0x219) [0x570a89]
 11: (MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)+0x418e) [0x590b0e]
 12: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x1b3) [0x5a55f3]
 13: (MDCache::dispatch(Message*)+0x105) [0x5ace65]
 14: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f]
 15: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5]
 16: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1]
 17: (SimpleMessenger::dispatch_entry()+0x879) [0x722769]
 18: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c]
 19: (()+0x69ca) [0x7ffb922429ca]
 20: (clone()+0x6d) [0x7ffb90cc770d]

filesystem stayed stuck on the clients.

stopped cmds on the other node and then started it on both nodes and they started successfully this time. Within a minute the filesystem on the clients started working again.

Started up the test scripts again and cmds crashed again almost immediately (on the other node this time) with a different error:

mds/CInode.cc: In function 'virtual void CInode::auth_unpin(void*)', in thread '0x7f084ddab700'
mds/CInode.cc: 1946: FAILED assert(auth_pins >= 0)
 ceph version  (commit:)
 1: (CInode::auth_unpin(void*)+0x49e) [0x66dd7e]
 2: (Locker::eval_gather(SimpleLock*, bool, bool*, std::list<Context*, std::allocator<Context*> >*)+0x4ef) [0x5e4c1f]
 3: (Locker::handle_file_lock(ScatterLock*, MLock*)+0xf22) [0x5f6582]
 4: (Locker::handle_lock(MLock*)+0x1e6) [0x5f6be6]
 5: (MDS::handle_deferrable_message(Message*)+0x62f) [0x4a335f]
 6: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5]
 7: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1]
 8: (SimpleMessenger::dispatch_entry()+0x879) [0x722769]
 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c]
 10: (()+0x69ca) [0x7f085162a9ca]
 11: (clone()+0x6d) [0x7f08500af70d]
 ceph version  (commit:)
 1: (CInode::auth_unpin(void*)+0x49e) [0x66dd7e]
 2: (Locker::eval_gather(SimpleLock*, bool, bool*, std::list<Context*, std::allocator<Context*> >*)+0x4ef) [0x5e4c1f]
 3: (Locker::handle_file_lock(ScatterLock*, MLock*)+0xf22) [0x5f6582]
 4: (Locker::handle_lock(MLock*)+0x1e6) [0x5f6be6]
 5: (MDS::handle_deferrable_message(Message*)+0x62f) [0x4a335f]
 6: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5]
 7: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1]
 8: (SimpleMessenger::dispatch_entry()+0x879) [0x722769]
 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c]
 10: (()+0x69ca) [0x7f085162a9ca]
 11: (clone()+0x6d) [0x7f08500af70d]
*** Caught signal (Aborted) **
 in thread 0x7f084ddab700
 ceph version  (commit:)
 1: /usr/bin/cmds() [0x794e74]
 2: (()+0xf8f0) [0x7f08516338f0]
 3: (gsignal()+0x35) [0x7f084fffca75]
 4: (abort()+0x180) [0x7f08500005c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f08508b28e5]
 6: (()+0xcad16) [0x7f08508b0d16]
 7: (()+0xcad43) [0x7f08508b0d43]
 8: (()+0xcae3e) [0x7f08508b0e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x39f) [0x716cdf]
 10: (CInode::auth_unpin(void*)+0x49e) [0x66dd7e]
 11: (Locker::eval_gather(SimpleLock*, bool, bool*, std::list<Context*, std::allocator<Context*> >*)+0x4ef) [0x5e4c1f]
 12: (Locker::handle_file_lock(ScatterLock*, MLock*)+0xf22) [0x5f6582]
 13: (Locker::handle_lock(MLock*)+0x1e6) [0x5f6be6]
 14: (MDS::handle_deferrable_message(Message*)+0x62f) [0x4a335f]
 15: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5]
 16: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1]
 17: (SimpleMessenger::dispatch_entry()+0x879) [0x722769]
 18: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c]
 19: (()+0x69ca) [0x7f085162a9ca]
 20: (clone()+0x6d) [0x7f08500af70d]

Repeated another time with the same auth_unpin error.

This is with latest code from master.


Files

mds.0.log.gz (15.6 KB) mds.0.log.gz John Leach, 09/12/2011 02:58 PM
mds.1.log.gz (60.6 KB) mds.1.log.gz John Leach, 09/12/2011 02:58 PM

Updated by John Leach over 12 years ago

logs from both mds servers from startup through to crash of one node (and then shutdown of the other).

debug ms = 5

Actions #2

Updated by Sage Weil over 11 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
Actions #3

Updated by Greg Farnum about 11 years ago

  • Priority changed from Normal to Low

De-prioritizing multi-MDS bugs at this time.

Actions #4

Updated by Zheng Yan about 11 years ago

  • Status changed from New to Resolved

I think this has been fixed by commit 00025462

Actions

Also available in: Atom PDF