Project

General

Profile

Actions

Bug #1527

closed

mds: assert in handle_cache_expire

Added by Brandon Seibel over 12 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running a multi 4 mds cluster: {0=mds01=up:active,1=mds02=up:active}, 2 up:standby-replay (mds01b and mds02b)

While rsyncing --inplace (to avoid the rename overhead) to a cfuse mounted cluster, seemingly randomly one of the mds's will die, in this case mds02 died:

mds/MDCache.cc: In function 'void MDCache::handle_cache_expire(MCacheExpire*)', in thread '0x7ffdbc380700'
mds/MDCache.cc: 5895: FAILED assert(!(parent_dir->is_auth() && parent_dir->is_exporting()) || (migrator->get_export_state(parent_dir) == Migrator::EXPORT_WARNING && !migrator->export_has_warned(parent_dir, from)))
ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
1: (MDCache::handle_cache_expire(MCacheExpire*)+0x15bb) [0x59407b]
2: (MDCache::dispatch(Message*)+0x145) [0x599d75]
3: (MDS::handle_deferrable_message(Message*)+0x67f) [0x4b749f]
4: (MDS::_dispatch(Message*)+0x5fa) [0x4c778a]
5: (MDS::ms_dispatch(Message*)+0x6d) [0x4c894d]
6: (SimpleMessenger::dispatch_entry()+0x7bb) [0x6e9e6b]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a6e5c]
8: (()+0x6b40) [0x7ffdbfc3db40]
9: (clone()+0x6d) [0x7ffdbe67336d]
ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
1: (MDCache::handle_cache_expire(MCacheExpire*)+0x15bb) [0x59407b]
2: (MDCache::dispatch(Message*)+0x145) [0x599d75]
3: (MDS::handle_deferrable_message(Message*)+0x67f) [0x4b749f]
4: (MDS::_dispatch(Message*)+0x5fa) [0x4c778a]
5: (MDS::ms_dispatch(Message*)+0x6d) [0x4c894d]
6: (SimpleMessenger::dispatch_entry()+0x7bb) [0x6e9e6b]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a6e5c]
8: (()+0x6b40) [0x7ffdbfc3db40]
9: (clone()+0x6d) [0x7ffdbe67336d]
  • Caught signal (Aborted) **

I'm assuming by the assert we're just bailing if we try to expire an entry that is mid migration?

Just prior in the logs:

2011-09-09 13:00:12.525389 7ffdbc380700 mds1.migrator nicely exporting to mds0 [dir 10000008e32 /test.com/www/ [2,head] auth{0=1} v=2304 cv=0/0 ap=0+4+5 state=1610612738|complete f(v0 m2011-09-09 12:27:40.579874 2=0+2) n(v149 rc2011-09-09 13:00
:10.531403 b95127759 6163=5630+533) hs=2+0,ss=0+0 dirty=2 | child replicated dirty 0x6ae57f8]
2011-09-09 13:01:22.526901 7ffdbc380700 mds1.bal reexporting [dir 10000008e31 /hybridcars.com/ [2,head] auth{0=1} v=1823 cv=0/0 dir_auth=1 ap=0+4+5 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v210 rc2011-09-09 13:01:20.575941
b111783894 6398=5864+534)/n(v210 rc2011-09-09 13:01:20.014747 b111688242 6395=5861+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608] pop 13.8947 back to mds0
2011-09-09 13:01:22.526954 7ffdbc380700 mds1.migrator nicely exporting to mds0 [dir 10000008e31 /test.com/ [2,head] auth{0=1} v=1823 cv=0/0 dir_auth=1 ap=0+4+5 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v210 rc2011-09-0
9 13:01:20.575941 b111783894 6398=5864+534)/n(v210 rc2011-09-09 13:01:20.014747 b111688242 6395=5861+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608]
2011-09-09 13:02:02.528112 7ffdbc380700 mds1.bal reexporting [dir 10000008e31 /test.com/ [2,head] auth{0=1} v=1883 cv=0/0 dir_auth=1 ap=0+3+4 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v216 rc2011-09-09 13:01:59.790317
b118892781 6563=6029+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608] pop 79.2118 back to mds0
2011-09-09 13:02:02.528168 7ffdbc380700 mds1.migrator nicely exporting to mds0 [dir 10000008e31 /test.com/ [2,head] auth{0=1} v=1883 cv=0/0 dir_auth=1 ap=0+3+4 state=1610612738|complete f(v0 m2011-09-09 12:25:51.259099 1=0+1) n(v216 rc2011-09-0
9 13:01:59.790317 b118892781 6563=6029+534) hs=1+0,ss=0+0 dirty=1 | child subtree replicated dirty 0xbe3f608]

Shortly after mds02 died, its slave(mds02b) died with the same assert.


Files

mds.mds02.log.1.gz (130 KB) mds.mds02.log.1.gz Brandon Seibel, 09/10/2011 06:57 AM
mds.mds02b.log.1.gz (930 KB) mds.mds02b.log.1.gz Brandon Seibel, 09/10/2011 06:57 AM
mds.mds01.log.1.gz (113 KB) mds.mds01.log.1.gz Brandon Seibel, 09/10/2011 06:57 AM
mds.mds01b.log.1.gz (794 KB) mds.mds01b.log.1.gz Brandon Seibel, 09/10/2011 06:57 AM
mds.b.log.gz (32.3 MB) mds.b.log.gz Sage Weil, 10/21/2012 02:25 PM
Actions #1

Updated by Brandon Seibel over 12 years ago

Sorry on second look I'm hitting the probably very rare'ish (and only amplified by my vm test boxes) race in the test to see if we're exporting, found not to be, but then am in time for the assert. Nuts.

Actions #2

Updated by Sage Weil over 12 years ago

  • Category set to 1

It looks like you had logging cranked up. Any chance you can attach all the MDS logs leading up to this crash? The migration vs cache expire interaction is tricky, and we'll probably need all that context to sort out what went wrong.

Updated by Brandon Seibel over 12 years ago

Unfortunately its not turned up too much but you can see the exporting.

Actions #4

Updated by Sage Weil over 12 years ago

  • Assignee set to Sage Weil

Hmm, I was wrong, the logging isn't all the way up.. there are just a few stray dout's that are making noise. If the you have the disk space and can leave debug mds = 20 and debug ms = 1 on for the mds and can reproduce, that would be awesome. I'll see if I can't reproduce it here.

Actions #5

Updated by Brandon Seibel over 12 years ago

I'll try to as well, I'm currently hitting bug 1538 far more often which is making reproducing this difficult.

Actions #6

Updated by Sage Weil about 12 years ago

  • Assignee deleted (Sage Weil)
Actions #7

Updated by Sage Weil over 11 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
Actions #8

Updated by Sage Weil over 11 years ago

  • File mds.b.log.gz mds.b.log.gz added
  • Category set to 47
  • Status changed from New to Fix Under Review
  • Assignee set to Sage Weil

reproduced this with a log. the problem is the preceding conditional not including the EXPORTING stage; testing the fix.

Actions #9

Updated by Sage Weil over 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions #10

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF