Project

General

Profile

Actions

Bug #689

closed

cmds crash: mds/CInode.h:1088: FAILED assert(ref_set.count(by) == 0)

Added by Ravi Pinjala over 13 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I just upgraded to Ceph 0.24, and when I started up the cluster I got
this crash.

2011-01-06 21:23:48.744864 b6855b70 mds0.cache creating system inode
with ino:601
2011-01-06 21:23:48.745488 b6855b70 log [ERR] : unmatched fragstat
size on single dirfrag 100, inode has f(v0 m2011-01-06 21:23:48.745063
1=0+1), dirfrag has f(v0 m2011-01-06 21:23:48.745063 3=1+2)
2011-01-06 21:23:49.293433 b6855b70 bad get [inode 600 [...2,head]
~mds0/stray/ auth v26003 f(v8833 m2011-01-04 21:50:03.135177
8846=8435+411) n(v11792 rc2011-01-04 21:50:03.135177 b25052311635 a-18
8846=8435+411) (inest lock dirty) (ifile lock dirty) (iversion lock) |
dirtyscattered dirfrag stray dirty 0x11102a68] by 19 stray was 5
(-1005,-1005,-1,19,1001)
mds/CInode.h: In function 'virtual void CInode::bad_get(int)':
mds/CInode.h:1088: FAILED assert(ref_set.count(by) == 0)
ceph version 0.24 (commit:180a4176035521940390f4ce24ee3eb7aa290632)
1: (CInode::bad_put(int)+0) [0x827b090]
2: (MDSCacheObject::get(int)+0x153) [0x813e463]
3: (MDCache::populate_mydir()+0x8a) [0x81a7e5a]
4: (MDCache::_create_system_file_finish(Mutation*, CDentry*,
Context*)+0x181) [0x819f501]
5: (C_MDC_CreateSystemFile::finish(int)+0x29) [0x81d6c29]
6: (finish_contexts(std::list<Context*, std::allocator<Context*> >&,
int)+0x6b) [0x81d663b]
7: (Journaler::_finish_flush(int, long long, utime_t, bool)+0x983) [0x82f2f53]
8: (Journaler::C_Flush::finish(int)+0x3f) [0x82fb24f]
9: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x801) [0x82d8e31]
10: (MDS::_dispatch(Message*)+0x2ae5) [0x80eaa15]
11: (MDS::ms_dispatch(Message*)+0x62) [0x80eb142]
12: (SimpleMessenger::dispatch_entry()+0x899) [0x80b8649]
13: (SimpleMessenger::DispatchThread::entry()+0x22) [0x80b30f2]
14: (Thread::_entry_func(void*)+0x11) [0x80c9101]
15: (()+0x5cc9) [0x6ffcc9]
16: (clone()+0x5e) [0x7e669e]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

(The objdump it mentions is attached.)

Actions #1

Updated by Ravi Pinjala over 13 years ago

Apparently the objdump is a bit big to post as an attachment. I've uploaded it at http://p-static.net/cmds-objdump.gz

Actions #2

Updated by Ravi Pinjala over 13 years ago

Here's (a portion of) the mds log with debug mds = 20:

2011-01-08 14:16:35.724241 b68bdb70 mds0.cache.ino(100) project_inode 0x9008160
2011-01-08 14:16:35.724278 b68bdb70 mds0.cache.ino(100) finish_scatter_gather_up
date 64 on [inode 100 [...2,head] ~mds0/ auth v2441 pv2442 ap=2 snaprealm=0x9010
140 f(v2 m2011-01-08 14:16:34.498053 5=1+4) n(v2325 rc2010-12-18 10:58:31.158056
b25052311635 a-18 8845=8435+410)/n() (inest lock w=1 dirty) (ifile lock->sync w
=1 dirty) (iversion lock) | dirtyscattered lock dirfrag dirty authpin 0x9012554]
2011-01-08 14:16:35.724368 b68bdb70 mds0.cache.ino(100) orig dirstat f(v2 m201
1-01-08 14:16:34.498053 5=1+4)
2011-01-08 14:16:35.724409 b68bdb70 mds0.cache.ino(100) 0/0 [dir 100 ~mds0/ [2,h
ead] auth v=26012 cv=0/0 dir_auth=0 ap=1+0+0 state=1610612738|complete f(v2 m201
1-01-08 14:16:34.498053 5=1+4) n(v2325 rc2011-01-08 14:16:34.498053 b25052311635
a-18 8851=8436+415)/n(v2325 rc2010-12-18 10:58:31.158056 b25052311635 a-18 8847
=8436+411) hs=5+0,ss=0+0 dirty=4 | child subtree dirty authpin 0x90504fc]
2011-01-08 14:16:35.724507 b68bdb70 mds0.cache.dir(100) project_fnode 0x8ffb300
2011-01-08 14:16:35.724539 b68bdb70 mds0.cache.ino(100) 0/0 fragstat f
(v2 m2011-01-08 14:16:34.498053 5=1+4)
2011-01-08 14:16:35.724580 b68bdb70 mds0.cache.ino(100) 0/0 accounted_fragstat f
(v2 m2011-01-08 14:16:34.498053 5=1+4)
2011-01-08 14:16:35.724622 b68bdb70 mds0.cache.ino(100) 0/0 updated accounted_fr
agstat f(v3 m2011-01-08 14:16:34.498053 5=1+4) on [dir 100 ~mds0/ [2,head] auth
pv=26012 v=26012 cv=0/0 dir_auth=0 ap=1+0+0 state=1610612738|complete f(v2 m2011
01-08 14:16:34.498053 5=1+4) n(v2325 rc2011-01-08 14:16:34.498053 b25052311635
a-18 8851=8436+415)/n(v2325 rc2010-12-18 10:58:31.158056 b25052311635 a-18 8847=
8436+411) hs=5+0,ss=0+0 dirty=4 | child subtree dirty authpin 0x90504fc]
2011-01-08 14:16:35.724726 b68bdb70 mds0.cache.ino(100) final dirstat f(v3 m201
1-01-08 14:16:34.498053 5=1+4)
2011-01-08 14:16:35.724780 b68bdb70 mds0.cache.ino(100) encode_snap_blob snaprea
lm(100 seq 1 lc 0 cr 0 cps 1 snaps={} 0x9010140)
2011-01-08 14:16:35.724830 b68bdb70 mds0.cache.ino(100) finish_scatter_gather_up
date_accounted 64 on [inode 100 [...2,head] ~mds0/ auth v2441 pv2442 ap=2 snapre
alm=0x9010140 f(v2 m2011-01-08 14:16:34.498053 5=1+4) n(v2325 rc2010-12-18 10:58
:31.158056 b25052311635 a-18 8845=8435+410)/n() (inest lock w=1 dirty) (ifile lo
ck
>sync w=1 flushing) (iversion lock) | dirtyscattered lock dirfrag dirty authp
in 0x9012554]
2011-01-08 14:16:35.724934 b68bdb70 mds0.cache.ino(100) journaling updated frag
accounted_ on [dir 100 ~mds0/ [2,head] auth pv=26012 v=26012 cv=0/0 dir_auth=0
ap=1+0+0 state=1610612738|complete f(v2 m2011-01-08 14:16:34.498053 5=1+4) n(v23
25 rc2011-01-08 14:16:34.498053 b25052311635 a-18 8851=8436+415)/n(v2325 rc2010-
12-18 10:58:31.158056 b25052311635 a-18 8847=8436+411) hs=5+0,ss=0+0 dirty=4 | c
hild subtree dirty authpin 0x90504fc]
2011-01-08 14:16:35.725027 b68bdb70 mds0.cache.dir(100) pre_dirty 26013
2011-01-08 14:16:35.725064 b68bdb70 mds0.cache.dir(100) auth_pin by 0x8ff7900 on
[dir 100 ~mds0/ [2,head] auth pv=26013 v=26012 cv=0/0 dir_auth=0 ap=2+0+0 state
=1610612738|complete f(v2 m2011-01-08 14:16:34.498053 5=1+4) n(v2325 rc2011-01-0
8 14:16:34.498053 b25052311635 a-18 8851=8436+415)/n(v2325 rc2010-12-18 10:58:31
.158056 b25052311635 a-18 8847=8436+411) hs=5+0,ss=0+0 dirty=4 | child subtree d
irty authpin 0x90504fc] count now 2 + 0
2011-01-08 14:16:35.725189 b68bdb70 mds0.log submit_entry 1296191220~796 : EUpda
te scatter_writebehind [metablob 100, 1 dirs]
2011-01-08 14:16:35.725327 b68bdb70 mds0.locker wrlock_finish on (inest lock w=1
dirty) on [inode 100 [...2,head] ~mds0/ auth v2441 pv2442 ap=2 snaprealm=0x9010
140 f(v2 m2011-01-08 14:16:34.498053 5=1+4) n(v2325 rc2010-12-18 10:58:31.158056
b25052311635 a-18 8845=8435+410)/n() (inest lock w=1 dirty) (ifile lock->sync w
=1 flushing) (iversion lock) | dirtyscattered lock dirfrag dirty authpin 0x90125
54]
2011-01-08 14:16:35.725489 b68bdb70 mds0.locker scatter_eval (inest lock dirty)
on [inode 100 [...2,head] ~mds0/ auth v2441 pv2442 ap=2 snaprealm=0x9010140 f(v2
m2011-01-08 14:16:34.498053 5=1+4) n(v2325 rc2010-12-18 10:58:31.158056 b250523
11635 a-18 8845=8435+410)/n() (inest lock dirty) (ifile lock->sync w=1 flushing) (iversion lock) | dirtyscattered lock dirfrag dirty authpin 0x9012554]
2011-01-08 14:16:35.725630 b68bdb70 mds0.cache.ino(100) auth_unpin by 0xc620000 on [inode 100 [...2,head] ~mds0/ auth v2441 pv2442 ap=1 snaprealm=0x9010140 f(v2 m2011-01-08 14:16:34.498053 5=1+4) n(v2325 rc2010-12-18 10:58:31.158056 b25052311635 a-18 8845=8435+410)/n() (inest lock dirty) (ifile lock->sync w=1 flushing) (iversion lock) | dirtyscattered lock dirfrag dirty authpin 0x9012554] now 1+0
2011-01-08 14:16:35.725840 b68bdb70 mds0.cache.dir(100) auth_unpin by 0xc620000 on [dir 100 ~mds0/ [2,head] auth pv=26013 v=26012 cv=0/0 dir_auth=0 ap=1+0+0 state=1610612738|complete f(v2 m2011-01-08 14:16:34.498053 5=1+4) n(v2325 rc2011-01-08 14:16:34.498053 b25052311635 a-18 8851=8436+415)/n(v2325 rc2010-12-18 10:58:31.158056 b25052311635 a-18 8847=8436+411) hs=5+0,ss=0+0 dirty=4 | child subtree dirty authpin 0x90504fc] count now 1 + 0
2011-01-08 14:16:35.725999 b68bdb70 mds0.cache open_root
2011-01-08 14:16:35.726053 b68bdb70 mds0.cache populate_mydir [dir 100 ~mds0/ [2,head] auth pv=26013 v=26012 cv=0/0 dir_auth=0 ap=1+0+0 state=1610612738|complete f(v2 m2011-01-08 14:16:34.498053 5=1+4) n(v2325 rc2011-01-08 14:16:34.498053 b25052311635 a-18 8851=8436+415)/n(v2325 rc2010-12-18 10:58:31.158056 b25052311635 a-18 8847=8436+411) hs=5+0,ss=0+0 dirty=4 | child subtree dirty authpin 0x90504fc]
2011-01-08 14:16:35.726218 b68bdb70 mds0.cache.dir(100) lookup (head, 'stray0')
2011-01-08 14:16:35.726267 b68bdb70 mds0.cache.dir(100) miss -> (stray1,head)
2011-01-08 14:16:35.726314 b68bdb70 mds0.cache.dir(100) lookup (head, 'stray')
2011-01-08 14:16:35.726360 b68bdb70 mds0.cache.dir(100) hit -> (stray,head)
2011-01-08 14:16:35.726406 b68bdb70 bad get [inode 600 [...2,head] ~mds0/stray/ auth v26003 f(v8833 m2011-01-04 21:50:03.135177 8846=8435+411) n(v11792 rc2011-01-04 21:50:03.135177 b25052311635 a-18 8846=8435+411) (inest lock dirty) (ifile lock dirty) (iversion lock) | dirtyscattered dirfrag stray dirty 0x101b5a68] by 19 stray was 5 (-1005,-1005,-1,19,1001)
mds/CInode.h: In function 'virtual void CInode::bad_get(int)':
mds/CInode.h:1088: FAILED assert(ref_set.count(by) == 0)
ceph version 0.24 (commit:180a4176035521940390f4ce24ee3eb7aa290632)
1: (CInode::bad_put(int)+0) [0x827b090]
2: (MDSCacheObject::get(int)+0x153) [0x813e463]
3: (MDCache::populate_mydir()+0x8a) [0x81a7e5a]
4: (MDCache::_create_system_file_finish(Mutation*, CDentry*, Context*)+0x181) [0x819f501]
5: (C_MDC_CreateSystemFile::finish(int)+0x29) [0x81d6c29]
6: (finish_contexts(std::list<Context*, std::allocator<Context*> >&, int)+0x6b) [0x81d663b]
7: (Journaler::_finish_flush(int, long long, utime_t, bool)+0x983) [0x82f2f53]
8: (Journaler::C_Flush::finish(int)+0x3f) [0x82fb24f]
9: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x801) [0x82d8e31]
10: (MDS::_dispatch(Message*)+0x2ae5) [0x80eaa15]
11: (MDS::ms_dispatch(Message*)+0x62) [0x80eb142]
12: (SimpleMessenger::dispatch_entry()+0x899) [0x80b8649]
13: (SimpleMessenger::DispatchThread::entry()+0x22) [0x80b30f2]
14: (Thread::_entry_func(void*)+0x11) [0x80c9101]
15: (()+0x5cc9) [0xbddcc9]
16: (clone()+0x5e) [0x43369e]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I don't know how much of the log is necessary to debug this, but I can post more if necessary.

Actions #3

Updated by Sage Weil over 13 years ago

Can you attach the whole log?

Actions #4

Updated by Ravi Pinjala over 13 years ago

Log sent in email due to paranoia about security ;)

Also, I've verified that I get the same crash with 0.24.1. The MDS went through the replay and reconnect stages, and then reached the active state, but when I tried to mount it crashed immediately with the same error. I've noticed that it always crashes on inode 600; could this be caused by filesystem corruption?

Actions #5

Updated by Sage Weil over 13 years ago

  • Category set to 1
  • Status changed from New to Resolved
  • Assignee set to Sage Weil
  • Target version set to v0.24.2

Thanks, Ravi. There's no corruption. Fixed by commit:6e6c22ea23918e72211eb9cbdac64aec4abf7e7c.

Actions #6

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.24.2)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF