Project

General

Profile

Actions

Bug #1527

closed

mds: assert in handle_cache_expire

Added by Brandon Seibel over 12 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running a multi 4 mds cluster: {0=mds01=up:active,1=mds02=up:active}, 2 up:standby-replay (mds01b and mds02b)

While rsyncing --inplace (to avoid the rename overhead) to a cfuse mounted cluster, seemingly randomly one of the mds's will die, in this case mds02 died:

mds/MDCache.cc: In function 'void MDCache::handle_cache_expire(MCacheExpire*)', in thread '0x7ffdbc380700'
mds/MDCache.cc: 5895: FAILED assert(!(parent_dir->is_auth() && parent_dir->is_exporting()) || (migrator->get_export_state(parent_dir) == Migrator::EXPORT_WARNING && !migrator->export_has_warned(parent_dir, from)))
ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
1: (MDCache::handle_cache_expire(MCacheExpire*)+0x15bb) [0x59407b]
2: (MDCache::dispatch(Message*)+0x145) [0x599d75]
3: (MDS::handle_deferrable_message(Message*)+0x67f) [0x4b749f]
4: (MDS::_dispatch(Message*)+0x5fa) [0x4c778a]
5: (MDS::ms_dispatch(Message*)+0x6d) [0x4c894d]
6: (SimpleMessenger::dispatch_entry()+0x7bb) [0x6e9e6b]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a6e5c]
8: (()+0x6b40) [0x7ffdbfc3db40]
9: (clone()+0x6d) [0x7ffdbe67336d]
ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
1: (MDCache::handle_cache_expire(MCacheExpire*)+0x15bb) [0x59407b]
2: (MDCache::dispatch(Message*)+0x145) [0x599d75]
3: (MDS::handle_deferrable_message(Message*)+0x67f) [0x4b749f]
4: (MDS::_dispatch(Message*)+0x5fa) [0x4c778a]
5: (MDS::ms_dispatch(Message*)+0x6d) [0x4c894d]
6: (SimpleMessenger::dispatch_entry()+0x7bb) [0x6e9e6b]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a6e5c]
8: (()+0x6b40) [0x7ffdbfc3db40]
9: (clone()+0x6d) [0x7ffdbe67336d]
  • Caught signal (Aborted) **

I'm assuming by the assert we're just bailing if we try to expire an entry that is mid migration?

Just prior in the logs:

2011-09-09 13:00:12.525389 7ffdbc380700 mds1.migrator nicely exporting to mds0 [dir 10000008e32 /test.com/www/ [2,head] auth{0=1} v=2304 cv=0/0 ap=0+4+5 state=1610612738|complete f(v0 m2011-09-09 12:27:40.579874 2=0+2) n(v149 rc2011-09-09 13:00
:10.531403 b95127759 6163=5630+533) hs=2+0,ss=0+0 dirty=2 | child replicated dirty 0x6ae57f8]
2011-09-09 13:01:22.526901 7ffdbc380700 mds1.bal reexporting [dir 10000008e31 /hybridcars.com/ [2,head] auth{0=1} v=1823 cv=0/0 dir_auth=1 ap=0+4+5 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v210 rc2011-09-09 13:01:20.575941
b111783894 6398=5864+534)/n(v210 rc2011-09-09 13:01:20.014747 b111688242 6395=5861+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608] pop 13.8947 back to mds0
2011-09-09 13:01:22.526954 7ffdbc380700 mds1.migrator nicely exporting to mds0 [dir 10000008e31 /test.com/ [2,head] auth{0=1} v=1823 cv=0/0 dir_auth=1 ap=0+4+5 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v210 rc2011-09-0
9 13:01:20.575941 b111783894 6398=5864+534)/n(v210 rc2011-09-09 13:01:20.014747 b111688242 6395=5861+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608]
2011-09-09 13:02:02.528112 7ffdbc380700 mds1.bal reexporting [dir 10000008e31 /test.com/ [2,head] auth{0=1} v=1883 cv=0/0 dir_auth=1 ap=0+3+4 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v216 rc2011-09-09 13:01:59.790317
b118892781 6563=6029+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608] pop 79.2118 back to mds0
2011-09-09 13:02:02.528168 7ffdbc380700 mds1.migrator nicely exporting to mds0 [dir 10000008e31 /test.com/ [2,head] auth{0=1} v=1883 cv=0/0 dir_auth=1 ap=0+3+4 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v216 rc2011-09-0
9 13:01:59.790317 b118892781 6563=6029+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608]

Shortly after mds02 died, its slave(mds02b) died with the same assert.


Files

mds.mds02.log.1.gz (130 KB) mds.mds02.log.1.gz Brandon Seibel, 09/10/2011 06:57 AM
mds.mds02b.log.1.gz (930 KB) mds.mds02b.log.1.gz Brandon Seibel, 09/10/2011 06:57 AM
mds.mds01.log.1.gz (113 KB) mds.mds01.log.1.gz Brandon Seibel, 09/10/2011 06:57 AM
mds.mds01b.log.1.gz (794 KB) mds.mds01b.log.1.gz Brandon Seibel, 09/10/2011 06:57 AM
mds.b.log.gz (32.3 MB) mds.b.log.gz Sage Weil, 10/21/2012 02:25 PM
Actions

Also available in: Atom PDF