Bug #1527: mds: assert in handle_cache_expire - CephFS - Ceph

Actions

Copy link

Bug #1527

closed

mds: assert in handle_cache_expire

Added by Brandon Seibel over 12 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Running a multi 4 mds cluster: {0=mds01=up:active,1=mds02=up:active}, 2 up:standby-replay (mds01b and mds02b)

While rsyncing --inplace (to avoid the rename overhead) to a cfuse mounted cluster, seemingly randomly one of the mds's will die, in this case mds02 died:

mds/MDCache.cc: In function 'void MDCache::handle_cache_expire(MCacheExpire*)', in thread '0x7ffdbc380700'
mds/MDCache.cc: 5895: FAILED assert(!(parent_dir->is_auth() && parent_dir->is_exporting()) || (migrator->get_export_state(parent_dir) == Migrator::EXPORT_WARNING && !migrator->export_has_warned(parent_dir, from)))
ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
1: (MDCache::handle_cache_expire(MCacheExpire*)+0x15bb) [0x59407b]
2: (MDCache::dispatch(Message*)+0x145) [0x599d75]
3: (MDS::handle_deferrable_message(Message*)+0x67f) [0x4b749f]
4: (MDS::_dispatch(Message*)+0x5fa) [0x4c778a]
5: (MDS::ms_dispatch(Message*)+0x6d) [0x4c894d]
6: (SimpleMessenger::dispatch_entry()+0x7bb) [0x6e9e6b]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a6e5c]
8: (()+0x6b40) [0x7ffdbfc3db40]
9: (clone()+0x6d) [0x7ffdbe67336d]
ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
1: (MDCache::handle_cache_expire(MCacheExpire*)+0x15bb) [0x59407b]
2: (MDCache::dispatch(Message*)+0x145) [0x599d75]
3: (MDS::handle_deferrable_message(Message*)+0x67f) [0x4b749f]
4: (MDS::_dispatch(Message*)+0x5fa) [0x4c778a]
5: (MDS::ms_dispatch(Message*)+0x6d) [0x4c894d]
6: (SimpleMessenger::dispatch_entry()+0x7bb) [0x6e9e6b]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a6e5c]
8: (()+0x6b40) [0x7ffdbfc3db40]
9: (clone()+0x6d) [0x7ffdbe67336d]

Caught signal (Aborted) **

I'm assuming by the assert we're just bailing if we try to expire an entry that is mid migration?

Just prior in the logs:

2011-09-09 13:00:12.525389 7ffdbc380700 mds1.migrator nicely exporting to mds0 [dir 10000008e32 /test.com/www/ [2,head] auth{0=1} v=2304 cv=0/0 ap=0+4+5 state=1610612738|complete f(v0 m2011-09-09 12:27:40.579874 2=0+2) n(v149 rc2011-09-09 13:00
:10.531403 b95127759 6163=5630+533) hs=2+0,ss=0+0 dirty=2 | child replicated dirty 0x6ae57f8]
2011-09-09 13:01:22.526901 7ffdbc380700 mds1.bal reexporting [dir 10000008e31 /hybridcars.com/ [2,head] auth{0=1} v=1823 cv=0/0 dir_auth=1 ap=0+4+5 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v210 rc2011-09-09 13:01:20.575941
b111783894 6398=5864+534)/n(v210 rc2011-09-09 13:01:20.014747 b111688242 6395=5861+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608] pop 13.8947 back to mds0
2011-09-09 13:01:22.526954 7ffdbc380700 mds1.migrator nicely exporting to mds0 [dir 10000008e31 /test.com/ [2,head] auth{0=1} v=1823 cv=0/0 dir_auth=1 ap=0+4+5 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v210 rc2011-09-0
9 13:01:20.575941 b111783894 6398=5864+534)/n(v210 rc2011-09-09 13:01:20.014747 b111688242 6395=5861+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608]
2011-09-09 13:02:02.528112 7ffdbc380700 mds1.bal reexporting [dir 10000008e31 /test.com/ [2,head] auth{0=1} v=1883 cv=0/0 dir_auth=1 ap=0+3+4 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v216 rc2011-09-09 13:01:59.790317
b118892781 6563=6029+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608] pop 79.2118 back to mds0
2011-09-09 13:02:02.528168 7ffdbc380700 mds1.migrator nicely exporting to mds0 [dir 10000008e31 /test.com/ [2,head] auth{0=1} v=1883 cv=0/0 dir_auth=1 ap=0+3+4 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v216 rc2011-09-0
9 13:01:59.790317 b118892781 6563=6029+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608]

Shortly after mds02 died, its slave(mds02b) died with the same assert.

Files

Download all files

mds.mds02.log.1.gz (130 KB) mds.mds02.log.1.gz		Brandon Seibel, 09/10/2011 06:57 AM
mds.mds02b.log.1.gz (930 KB) mds.mds02b.log.1.gz		Brandon Seibel, 09/10/2011 06:57 AM
mds.mds01.log.1.gz (113 KB) mds.mds01.log.1.gz		Brandon Seibel, 09/10/2011 06:57 AM
mds.mds01b.log.1.gz (794 KB) mds.mds01b.log.1.gz		Brandon Seibel, 09/10/2011 06:57 AM
mds.b.log.gz (32.3 MB) mds.b.log.gz		Sage Weil, 10/21/2012 02:25 PM

Actions

Copy link

Updated by Brandon Seibel over 12 years ago

Sorry on second look I'm hitting the probably very rare'ish (and only amplified by my vm test boxes) race in the test to see if we're exporting, found not to be, but then am in time for the assert. Nuts.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Category set to 1

It looks like you had logging cranked up. Any chance you can attach all the MDS logs leading up to this crash? The migration vs cache expire interaction is tricky, and we'll probably need all that context to sort out what went wrong.

Actions

Copy link Download all files

Updated by Brandon Seibel over 12 years ago

File mds.mds02.log.1.gz mds.mds02.log.1.gz added
File mds.mds02b.log.1.gz mds.mds02b.log.1.gz added
File mds.mds01.log.1.gz mds.mds01.log.1.gz added
File mds.mds01b.log.1.gz mds.mds01b.log.1.gz added

Unfortunately its not turned up too much but you can see the exporting.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Assignee set to Sage Weil

Hmm, I was wrong, the logging isn't all the way up.. there are just a few stray dout's that are making noise. If the you have the disk space and can leave debug mds = 20 and debug ms = 1 on for the mds and can reproduce, that would be awesome. I'll see if I can't reproduce it here.

Actions

Copy link