Actions
Bug #1535
closedconcurrent creating and removing directories crashes cmds
Status:
Resolved
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
setup two clients with a mounted ceph filesystem, had one creating a hierarchy of empty directories in a loop and the other deleting them in a loop.
got two mds servers, one crashed:
2011-09-12 18:02:35.414589 7f1c719b4700 mds0.migrator nicely exporting to mds1 [dir 10000010809 /dirs/ [2,head] auth{1=1} pv=3421 v=3417 cv=0/0 ap=2+2+3 state=1610612738|complete f(v6 m2011-09-12 17:40:09.473169 4=0+4) n(v214 rc2011-09-12 18:02:34.405528 1260=0+1260)/n(v214 rc2011-09-12 18:02:34.347155 1261=0+1261) hs=4+1,ss=0+0 dirty=4 | child replicated dirty authpin 0x122395b8] 2011-09-12 18:02:45.183404 7f1c719b4700 mds0.bal mds0 mdsload<[79.2128,0 79.2128]/[7.27142,0 7.27142], req 0, hr 0, qlen 0, cpu 0.25> = 64.8246 ~ 79.2128 2011-09-12 18:02:45.183454 7f1c719b4700 mds0.bal mds1 mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.33> = 532.38 ~ 650.545 mds/MDCache.cc: In function 'void MDCache::handle_dentry_link(MDentryLink*)', in thread '0x7f1c719b4700' mds/MDCache.cc: 9213: FAILED assert(dn) ceph version (commit:) 1: (MDCache::handle_dentry_link(MDentryLink*)+0x361) [0x56a1c1] 2: (MDCache::dispatch(Message*)+0x175) [0x5aced5] 3: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f] 4: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5] 5: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1] 6: (SimpleMessenger::dispatch_entry()+0x879) [0x722769] 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c] 8: (()+0x69ca) [0x7f1c752339ca] 9: (clone()+0x6d) [0x7f1c73cb870d] ceph version (commit:) 1: (MDCache::handle_dentry_link(MDentryLink*)+0x361) [0x56a1c1] 2: (MDCache::dispatch(Message*)+0x175) [0x5aced5] 3: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f] 4: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5] 5: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1] 6: (SimpleMessenger::dispatch_entry()+0x879) [0x722769] 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c] 8: (()+0x69ca) [0x7f1c752339ca] 9: (clone()+0x6d) [0x7f1c73cb870d] *** Caught signal (Aborted) ** in thread 0x7f1c719b4700 ceph version (commit:) 1: /usr/bin/cmds() [0x794e74] 2: (()+0xf8f0) [0x7f1c7523c8f0] 3: (gsignal()+0x35) [0x7f1c73c05a75] 4: (abort()+0x180) [0x7f1c73c095c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f1c744bb8e5] 6: (()+0xcad16) [0x7f1c744b9d16] 7: (()+0xcad43) [0x7f1c744b9d43] 8: (()+0xcae3e) [0x7f1c744b9e3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x39f) [0x716cdf] 10: (MDCache::handle_dentry_link(MDentryLink*)+0x361) [0x56a1c1] 11: (MDCache::dispatch(Message*)+0x175) [0x5aced5] 12: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f] 13: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5] 14: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1] 15: (SimpleMessenger::dispatch_entry()+0x879) [0x722769] 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c] 17: (()+0x69ca) [0x7f1c752339ca] 18: (clone()+0x6d) [0x7f1c73cb870d]
The filesystem hung on both servers.
I restarted cmds on the crashed node and it crashed again during startup:
2011-09-12 18:15:36.018540 7ffb8e9c3700 mds0.3 reconnect_done 2011-09-12 18:15:36.027241 7ffb8e9c3700 mds0.3 handle_mds_map i am now mds0.3 2011-09-12 18:15:36.027259 7ffb8e9c3700 mds0.3 handle_mds_map state change up:reconnect --> up:rejoin 2011-09-12 18:15:36.027266 7ffb8e9c3700 mds0.3 rejoin_joint_start 2011-09-12 18:15:36.018540 7ffb8e9c3700 mds0.3 reconnect_done 2011-09-12 18:15:36.027241 7ffb8e9c3700 mds0.3 handle_mds_map i am now mds0.3 2011-09-12 18:15:36.027259 7ffb8e9c3700 mds0.3 handle_mds_map state change up:reconnect --> up:rejoin 2011-09-12 18:15:36.027266 7ffb8e9c3700 mds0.3 rejoin_joint_start mds/MDCache.cc: In function 'CDir* MDCache::rejoin_invent_dirfrag(dirfrag_t)', in thread '0x7ffb8e9c3700' mds/MDCache.cc: 3937: FAILED assert(in->is_dir()) ceph version (commit:) 1: (MDCache::rejoin_invent_dirfrag(dirfrag_t)+0x219) [0x570a89] 2: (MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)+0x418e) [0x590b0e] 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x1b3) [0x5a55f3] 4: (MDCache::dispatch(Message*)+0x105) [0x5ace65] 5: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f] 6: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5] 7: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1] 8: (SimpleMessenger::dispatch_entry()+0x879) [0x722769] 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c] 10: (()+0x69ca) [0x7ffb922429ca] 11: (clone()+0x6d) [0x7ffb90cc770d] ceph version (commit:) 1: (MDCache::rejoin_invent_dirfrag(dirfrag_t)+0x219) [0x570a89] 2: (MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)+0x418e) [0x590b0e] 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x1b3) [0x5a55f3] 4: (MDCache::dispatch(Message*)+0x105) [0x5ace65] 5: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f] 6: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5] 7: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1] 8: (SimpleMessenger::dispatch_entry()+0x879) [0x722769] 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c] 10: (()+0x69ca) [0x7ffb922429ca] 11: (clone()+0x6d) [0x7ffb90cc770d] *** Caught signal (Aborted) ** in thread 0x7ffb8e9c3700 ceph version (commit:) 1: /usr/bin/cmds() [0x794e74] 2: (()+0xf8f0) [0x7ffb9224b8f0] 3: (gsignal()+0x35) [0x7ffb90c14a75] 4: (abort()+0x180) [0x7ffb90c185c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7ffb914ca8e5] 6: (()+0xcad16) [0x7ffb914c8d16] 7: (()+0xcad43) [0x7ffb914c8d43] 8: (()+0xcae3e) [0x7ffb914c8e3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x39f) [0x716cdf] 10: (MDCache::rejoin_invent_dirfrag(dirfrag_t)+0x219) [0x570a89] 11: (MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)+0x418e) [0x590b0e] 12: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x1b3) [0x5a55f3] 13: (MDCache::dispatch(Message*)+0x105) [0x5ace65] 14: (MDS::handle_deferrable_message(Message*)+0x60f) [0x4a333f] 15: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5] 16: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1] 17: (SimpleMessenger::dispatch_entry()+0x879) [0x722769] 18: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c] 19: (()+0x69ca) [0x7ffb922429ca] 20: (clone()+0x6d) [0x7ffb90cc770d]
filesystem stayed stuck on the clients.
stopped cmds on the other node and then started it on both nodes and they started successfully this time. Within a minute the filesystem on the clients started working again.
Started up the test scripts again and cmds crashed again almost immediately (on the other node this time) with a different error:
mds/CInode.cc: In function 'virtual void CInode::auth_unpin(void*)', in thread '0x7f084ddab700' mds/CInode.cc: 1946: FAILED assert(auth_pins >= 0) ceph version (commit:) 1: (CInode::auth_unpin(void*)+0x49e) [0x66dd7e] 2: (Locker::eval_gather(SimpleLock*, bool, bool*, std::list<Context*, std::allocator<Context*> >*)+0x4ef) [0x5e4c1f] 3: (Locker::handle_file_lock(ScatterLock*, MLock*)+0xf22) [0x5f6582] 4: (Locker::handle_lock(MLock*)+0x1e6) [0x5f6be6] 5: (MDS::handle_deferrable_message(Message*)+0x62f) [0x4a335f] 6: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5] 7: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1] 8: (SimpleMessenger::dispatch_entry()+0x879) [0x722769] 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c] 10: (()+0x69ca) [0x7f085162a9ca] 11: (clone()+0x6d) [0x7f08500af70d] ceph version (commit:) 1: (CInode::auth_unpin(void*)+0x49e) [0x66dd7e] 2: (Locker::eval_gather(SimpleLock*, bool, bool*, std::list<Context*, std::allocator<Context*> >*)+0x4ef) [0x5e4c1f] 3: (Locker::handle_file_lock(ScatterLock*, MLock*)+0xf22) [0x5f6582] 4: (Locker::handle_lock(MLock*)+0x1e6) [0x5f6be6] 5: (MDS::handle_deferrable_message(Message*)+0x62f) [0x4a335f] 6: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5] 7: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1] 8: (SimpleMessenger::dispatch_entry()+0x879) [0x722769] 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c] 10: (()+0x69ca) [0x7f085162a9ca] 11: (clone()+0x6d) [0x7f08500af70d] *** Caught signal (Aborted) ** in thread 0x7f084ddab700 ceph version (commit:) 1: /usr/bin/cmds() [0x794e74] 2: (()+0xf8f0) [0x7f08516338f0] 3: (gsignal()+0x35) [0x7f084fffca75] 4: (abort()+0x180) [0x7f08500005c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f08508b28e5] 6: (()+0xcad16) [0x7f08508b0d16] 7: (()+0xcad43) [0x7f08508b0d43] 8: (()+0xcae3e) [0x7f08508b0e3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x39f) [0x716cdf] 10: (CInode::auth_unpin(void*)+0x49e) [0x66dd7e] 11: (Locker::eval_gather(SimpleLock*, bool, bool*, std::list<Context*, std::allocator<Context*> >*)+0x4ef) [0x5e4c1f] 12: (Locker::handle_file_lock(ScatterLock*, MLock*)+0xf22) [0x5f6582] 13: (Locker::handle_lock(MLock*)+0x1e6) [0x5f6be6] 14: (MDS::handle_deferrable_message(Message*)+0x62f) [0x4a335f] 15: (MDS::_dispatch(Message*)+0x5e5) [0x4b92f5] 16: (MDS::ms_dispatch(Message*)+0x71) [0x4ba5c1] 17: (SimpleMessenger::dispatch_entry()+0x879) [0x722769] 18: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49439c] 19: (()+0x69ca) [0x7f085162a9ca] 20: (clone()+0x6d) [0x7f08500af70d]
Repeated another time with the same auth_unpin error.
This is with latest code from master.
Files
Actions