Project

General

Profile

Actions

Bug #1390

closed

MDS crash in function 'bool Locker::issue_caps(CInode*, Capability*)', in thread '0x7f64cb2f5700'

Added by Damien Churchill over 12 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Yesterday I was playing about with snapshots, which seemed to expose a bug in btrfs (delayed_inode.c) which I have since patched and deployed a new kernel (3.0.1 + Ubuntu patches + https://patchwork.kernel.org/patch/1019032/) across the cluster. When trying to get Ceph back up and running I'm encountering an error in the metadata servers. This is completely reproducible and is preventing the cluster from starting up again. Running Ceph 0.32 installed via the packages.

$ ceph -s
2011-08-11 11:29:47.270271    pg v114261: 2780 pgs: 2780 active+clean; 849 GB data, 1714 GB used, 4129 GB / 5962 GB avail
2011-08-11 11:29:47.275199   mds e531: 1/1/1 up {0=dev1=up:active(laggy or crashed)}
2011-08-11 11:29:47.275235   osd e642: 13 osds: 13 up, 13 in
2011-08-11 11:29:47.275311   log 2011-08-11 11:20:28.092853 mon0 10.10.20.1:6789/0 174 : [INF] mds0 10.10.20.1:6800/14818 up:active
2011-08-11 11:29:47.275398   mon e1: 5 mons at {0=10.10.20.1:6789/0,1=10.10.20.2:6789/0,2=10.10.20.3:6789/0,3=10.10.20.4:6789/0,4=10.10.20.5:6789/0}

Here is the end of the log. I'm attaching the whole log, from mds start to when it crashes.

2011-08-11 11:20:28.062883 7f64cb2f5700 -- 10.10.20.1:6800/14818 --> 10.10.20.5:6789/0 -- mdsbeacon(259239/dev1 up:active seq 19 v502) v1 -- ?+0 0x2b7bc80 con 0x14c6a00
2011-08-11 11:20:28.102248 7f64cb2f5700 -- 10.10.20.1:6800/14818 <== mon4 10.10.20.5:6789/0 32 ==== mdsmap(e 503) v1 ==== 526+0+0 (1827618904 0 0) 0x14d6200 con 0x14c6a00
2011-08-11 11:20:28.102279 7f64cb2f5700 mds0.35 handle_mds_map epoch 503 from mon4
2011-08-11 11:20:28.102317 7f64cb2f5700 mds0.35      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}
2011-08-11 11:20:28.102326 7f64cb2f5700 mds0.35  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}
2011-08-11 11:20:28.102337 7f64cb2f5700 mds0.35 map says i am 10.10.20.1:6800/14818 mds0.35 state up:active
2011-08-11 11:20:28.102354 7f64cb2f5700 mds0.35 handle_mds_map i am now mds0.35
2011-08-11 11:20:28.102362 7f64cb2f5700 mds0.35 handle_mds_map state change up:rejoin --> up:active
2011-08-11 11:20:28.102368 7f64cb2f5700 mds0.35 recovery_done -- successful recovery!
2011-08-11 11:20:28.104159 7f64cb2f5700 mds0.tableserver(anchortable) finish_recovery
2011-08-11 11:20:28.104211 7f64cb2f5700 mds0.tableserver(snaptable) finish_recovery
2011-08-11 11:20:28.104221 7f64cb2f5700 mds0.tableclient(anchortable) finish_recovery
2011-08-11 11:20:28.104228 7f64cb2f5700 mds0.tableclient(snaptable) finish_recovery
2011-08-11 11:20:28.104235 7f64cb2f5700 mds0.cache start_recovered_truncates
2011-08-11 11:20:28.104251 7f64cb2f5700 mds0.cache do_file_recover 3 queued, 0 recovering
2011-08-11 11:20:28.104274 7f64cb2f5700 log [WRN] : bad client_range {256955=0-0@0} on ino 100000030a4
2011-08-11 11:20:28.104301 7f64cb2f5700 mds0.cache do_file_recover skipping 0 [inode 100000030a4 [2,5] /fileserver/backups/kvm1/scripts/disk.img auth v16 ap=1+0 recovering s=0 n(v0 1=1+0) (ifile scan) (iversion lock) cr={256955=0-0@0} | dirty authpin 0x54bfa40]
2011-08-11 11:20:28.104319 7f64cb2f5700 mds0.cache.ino(100000030a4) auth_unpin by 0x14cc000 on [inode 100000030a4 [2,5] /fileserver/backups/kvm1/scripts/disk.img auth v16 s=0 n(v0 1=1+0) (ifile scan) (iversion lock) cr={256955=0-0@0} | dirty 0x54bfa40] now 0+0
2011-08-11 11:20:28.104341 7f64cb2f5700 mds0.cache.dir(100000030a2) adjust_nested_auth_pins -1/-1 on [dir 100000030a2 /fileserver/backups/kvm1/scripts/ [2,head] auth v=22 cv=0/0 ap=0+2+2 state=1610612738|complete f(v1 m2011-08-10 13:56:07.999987 3=3+0) n(v0 rc2011-08-10 13:56:07.999987 3=3+0) hs=3+0,ss=3+0 dirty=6 | child dirty 0x367c050] by 0x14cc000 count now 0 + 2
2011-08-11 11:20:28.104362 7f64cb2f5700 mds0.locker file_eval wanted= loner_wanted= other_wanted=  filelock=(ifile scan) on [inode 100000030a4 [2,5] /fileserver/backups/kvm1/scripts/disk.img auth v16 s=0 n(v0 1=1+0) (ifile scan) (iversion lock) cr={256955=0-0@0} | dirty 0x54bfa40]
2011-08-11 11:20:28.104377 7f64cb2f5700 mds0.locker file_eval stable, bump to sync (ifile scan) on [inode 100000030a4 [2,5] /fileserver/backups/kvm1/scripts/disk.img auth v16 s=0 n(v0 1=1+0) (ifile scan) (iversion lock) cr={256955=0-0@0} | dirty 0x54bfa40]
2011-08-11 11:20:28.104391 7f64cb2f5700 mds0.locker simple_sync on (ifile scan) on [inode 100000030a4 [2,5] /fileserver/backups/kvm1/scripts/disk.img auth v16 s=0 n(v0 1=1+0) (ifile scan) (iversion lock) cr={256955=0-0@0} | dirty 0x54bfa40]
2011-08-11 11:20:28.104418 7f64cb2f5700 mds0.locker issue_caps allowed=pAsLsXsFscr, xlocker allowed=pAsLsXsFscr on [inode 100000030a4 [2,5] /fileserver/backups/kvm1/scripts/disk.img auth v16 s=0 n(v0 1=1+0) (iversion lock) cr={256955=0-0@0} | dirty 0x54bfa40]
mds/Locker.cc: In function 'bool Locker::issue_caps(CInode*, Capability*)', in thread '0x7f64cb2f5700'
mds/Locker.cc: 1537: FAILED assert(in->is_head())
 ceph version 0.32 (commit:c08d08baa6a945d989427563e46c992f757ad5eb)
 1: (Locker::issue_caps(CInode*, Capability*)+0xc1b) [0x5e043b]
 2: (MDCache::do_file_recover()+0x214) [0x566194]
 3: (MDS::recovery_done()+0x13a) [0x4bd65a]
 4: (MDS::handle_mds_map(MMDSMap*)+0x27f5) [0x4cbbd5]
 5: (MDS::handle_core_message(Message*)+0x99b) [0x4cc9fb]
 6: (MDS::_dispatch(Message*)+0x157) [0x4ccc97]
 7: (MDS::ms_dispatch(Message*)+0x6d) [0x4ce19d]
 8: (SimpleMessenger::dispatch_entry()+0x7db) [0x74429b]
 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4aa42c]
 10: (()+0x6d8c) [0x7f64ce9acd8c]
 11: (clone()+0x6d) [0x7f64cd5f904d]
 ceph version 0.32 (commit:c08d08baa6a945d989427563e46c992f757ad5eb)
 1: (Locker::issue_caps(CInode*, Capability*)+0xc1b) [0x5e043b]
 2: (MDCache::do_file_recover()+0x214) [0x566194]
 3: (MDS::recovery_done()+0x13a) [0x4bd65a]
 4: (MDS::handle_mds_map(MMDSMap*)+0x27f5) [0x4cbbd5]
 5: (MDS::handle_core_message(Message*)+0x99b) [0x4cc9fb]
 6: (MDS::_dispatch(Message*)+0x157) [0x4ccc97]
 7: (MDS::ms_dispatch(Message*)+0x6d) [0x4ce19d]
 8: (SimpleMessenger::dispatch_entry()+0x7db) [0x74429b]
 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4aa42c]
 10: (()+0x6d8c) [0x7f64ce9acd8c]
 11: (clone()+0x6d) [0x7f64cd5f904d]
*** Caught signal (Aborted) **
 in thread 0x7f64cb2f5700
 ceph version 0.32 (commit:c08d08baa6a945d989427563e46c992f757ad5eb)
 1: /usr/bin/cmds() [0x7b19f2]
 2: (()+0xfc60) [0x7f64ce9b5c60]
 3: (gsignal()+0x35) [0x7f64cd546d05]
 4: (abort()+0x186) [0x7f64cd54aab6]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f64cddfdd6d]
 6: (()+0xb9f16) [0x7f64cddfbf16]
 7: (()+0xb9f43) [0x7f64cddfbf43]
 8: (()+0xba03e) [0x7f64cddfc03e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x729d86]
 10: (Locker::issue_caps(CInode*, Capability*)+0xc1b) [0x5e043b]
 11: (MDCache::do_file_recover()+0x214) [0x566194]
 12: (MDS::recovery_done()+0x13a) [0x4bd65a]
 13: (MDS::handle_mds_map(MMDSMap*)+0x27f5) [0x4cbbd5]
 14: (MDS::handle_core_message(Message*)+0x99b) [0x4cc9fb]
 15: (MDS::_dispatch(Message*)+0x157) [0x4ccc97]
 16: (MDS::ms_dispatch(Message*)+0x6d) [0x4ce19d]
 17: (SimpleMessenger::dispatch_entry()+0x7db) [0x74429b]
 18: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4aa42c]
 19: (()+0x6d8c) [0x7f64ce9acd8c]
 20: (clone()+0x6d) [0x7f64cd5f904d]

Files

mds.dev1.log.gz (14.5 MB) mds.dev1.log.gz Damien Churchill, 08/11/2011 03:36 AM
Actions #1

Updated by Sage Weil over 12 years ago

  • Category set to 1
  • Assignee set to Sage Weil
  • Target version set to v0.34
Actions #2

Updated by Sage Weil over 12 years ago

  • Status changed from New to 7

Pushed commit:26871eff1740d6ec5b9b287bf47e098db913fb27 (branch wip-needissue) that should fix this. Can you let me know if that does the trick?

Thanks! The bug report had exactly the information I needed. :)

Actions #3

Updated by Damien Churchill over 12 years ago

Will it be safe to cherry-pick this onto 0.32? Else I can try packaging up that branch and deploying it!

Actions #4

Updated by Damien Churchill over 12 years ago

It's looking good, cluster has started up okay, no metadata crashes :-)

Actions #5

Updated by Sage Weil over 12 years ago

  • Status changed from 7 to Resolved

pushed to next/master, will be in v0.33.

Actions #6

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.34)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF