Project

General

Profile

Actions

Bug #663

closed

cmds segfaults

Added by Alexander Rødseth over 13 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello.

One of my cmds daemons segfaults.

After asking for advice on #ceph/irc.oftc.net, I used cdebugpack -c /etc/ceph/ceph.conf mds_crash.tar.gz to produce the following file:
http://68.178.169.4:81/mds_crash.tar.gz
(approx 124M)

Here's some commandline output:
http://aur.pastebin.com/eL7g2Uz6
http://aur.pastebin.com/911FaQNs

This is all running on Arch Linux, as a small test-cluster.
I have used the very latest git-version of both ceph, the linux kernel and btrfs-progs.

kernel: 2.6.37-rc6-00009-gb3444d1-dirty
ceph version 0.23.2 (commit:5bdae2af8c53adb2e059022c58813e97e7a7ba5d)
Btrfs v0.19-35-g1b444cd

To reproduce, I start cmds on that particular machine and just wait a few seconds.

The other nodes seems to be running just fine, except "1 crashed+peering":
pg v23517: 792 pgs: 791 active+clean, 1 crashed+peering; 270 GB data, 681 GB used, 4202 GB / 4890 GB avail

Hope there's at least some useful info in there somewhere.

Best regards,
Alexander Rødseth

Actions #1

Updated by Alexander Rødseth over 13 years ago

Uploaded another crash .tar.gz:
http://68.178.169.4:81/mds_crash2.tar.gz
(approx 125M)

Actions #2

Updated by Alexander Rødseth over 13 years ago

After even more advice on IRC, here's what happens on the commandline:

cmds -n mds.dango -D --debug_mds 20 > output.txt
mds/events/../CDir.h: In function 'void CDir::set_version(version_t)':
mds/events/../CDir.h:189: FAILED assert(projected_fnode.empty())
ceph version 0.23 (commit:5d1d8d0c4602be9819cc9f7aea562fccbb005a56)
1: (EMetaBlob::replay(MDS*, LogSegment*)+0x3b89) [0x667669]
2: (EExport::replay(MDS*)+0x41) [0x667c61]
3: (MDLog::_replay_thread()+0x10dd) [0x65b20d]
4: (MDLog::ReplayThread::entry()+0xd) [0x4c516d]
5: (Thread::_entry_func(void*)+0x7) [0x4a6c57]
6: (()+0x6cb0) [0x7f3906898cb0]
7: (clone()+0x6d) [0x7f3905aa59dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
[1] 521 abort (core dumped) cmds -n mds.dango -D --debug_mds 20 > output.txt

And the last lines of output.txt (a lot of repetetive text first):

[...]

2010-12-21 11:11:49.620740 7f39016ac710 mds2.cache.ino(100000011a8) finish_scatter_gather_update_accounted 1024 on [inode 100000011a8 [...2,head] /laptop_Diderich/paul - Copy (3)/ auth v5513 pv5514 ap=1 f(v2 m2010-12-14 16:43:25.321999 2=0+2) n(v149 rc2010-12-14 16:44:19.154780 b364585239 334=317+17) (inest sync->lock w=1 flushing) (ifile sync dirty) (iversion lock) | dirtyscattered lock dirfrag dirtyrstat authpin 0x7f3900836b88]
2010-12-21 11:11:49.620760 7f39016ac710 mds2.cache.ino(100000011a8) journaling updated frag accounted_ on [dir 100000011a8 /laptop_Diderich/paul - Copy (3)/ [2,head] auth pv=1117 v=1117 cv=0/0 state=1610743808 f(v2 m2010-12-14 16:43:25.321999 2=0+2) n(v149 rc2010-12-14 16:44:19.154780 b364585239 333=317+16) hs=0+0,ss=0+0 | dirty 0x1dcfd10]
2010-12-21 11:11:49.620776 7f39016ac710 mds2.cache.dir(100000011a8) pre_dirty 1118
2010-12-21 11:11:49.620785 7f39016ac710 mds2.cache.dir(100000011a8) auth_pin by 0x1ddda80 on [dir 100000011a8 /laptop_Diderich/paul - Copy (3)/ [2,head] auth pv=1118 v=1117 cv=0/0 ap=1+0+0 state=1610743808 f(v2 m2010-12-14 16:43:25.321999 2=0+2) n(v149 rc2010-12-14 16:44:19.154780 b364585239 333=317+16) hs=0+0,ss=0+0 | dirty authpin 0x1dcfd10] count now 1 + 0
2010-12-21 11:11:49.620803 7f39016ac710 mds2.cache.dir(100000004e9) adjust_nested_auth_pins 1/0 on [dir 100000004e9 /laptop_Diderich/ [2,head] auth pv=5515 v=0 cv=0/0 ap=1+1+2 state=1073741824 f(v5 m2010-12-14 16:43:24.754642 8=0+8) n(v533 rc2010-12-14 16:44:19.154780 b3698880281 1594=1519+75) hs=4+0,ss=0+0 | child authpin 0x1dcf700] count now 1 + 2
2010-12-21 11:11:49.620821 7f39016ac710 mds2.cache.dir(100000011a8) assimilate_dirty_rstat_inodes_finish
2010-12-21 11:11:49.620877 7f39016ac710 mds2.log submit_entry 9548453~1608 : EUpdate scatter_writebehind [metablob 1, 3 dirs]
2010-12-21 11:11:49.620935 7f39016ac710 -- 10.10.10.4:6800/521 --> osd0 10.10.10.45:6800/2723 -- osd_op(mds2.20:11 202.00000002 [write 1159845~1612] 1.49be) v1 -- ?+0 0x1e7bf50
2010-12-21 11:11:49.621058 7f39016ac710 mds2.locker eval done
2010-12-21 11:11:49.621083 7f39016ac710 mds2.cache show_subtrees
2010-12-21 11:11:49.621098 7f39016ac710 mds2.cache |_.__-2 auth [dir 1 / [2,head] auth v=0 cv=0/0 dir_auth=-2 ap=0+0+2 state=1073741824 f(v7 m2010-12-16 11:52:50.836681 14=1+13) n(v1905 rc2010-12-20 13:34:29.614973 b286014461157 3444=3322+122) hs=2+0,ss=0+0 | child subtree 0x1dc4730]
2010-12-21 11:11:49.621121 7f39016ac710 mds2.cache | |__ 2 auth [dir 10000000d2f /laptop_Diderich/paul/ [2,head] auth v=1832 cv=0/0 dir_auth=2 state=1610612738|complete f(v2 m2010-12-14 16:40:29.582273 2=0+2) n(v198 rc2010-12-14 16:41:53.197865 b435773446 633=606+27) hs=2+3,ss=0+0 dirty=2 | child subtree dirty 0x1eb5ee0]
2010-12-21 11:11:49.621141 7f39016ac710 mds2.cache | |__ 2 auth [dir 10000000fa8 /laptop_Diderich/paul - Copy/ [2,head] auth v=694 cv=0/0 dir_auth=2 state=1610612738|complete f(v2 m2010-12-14 16:41:53.793457 2=0+2) n(v94 rc2010-12-14 16:42:30.609457 b364419477 177=169+8) hs=2+3,ss=0+0 dirty=2 | child subtree dirty 0x1eb2850]
2010-12-21 11:11:49.621187 7f39016ac710 mds2.cache | |__ 2 auth [dir 1000000105b /laptop_Diderich/paul - Copy (2)/ [2,head] auth v=1379 cv=0/0 dir_auth=2 state=1610612738|complete f(v5 m2010-12-20 13:36:54.546394 3=0+3) n(v190 rc2010-12-20 13:37:17.717986 b1641792470 351=334+17)/n(v190 rc2010-12-20 13:37:02.082881 b703925435 347=330+17) hs=3+5,ss=0+0 dirty=4 | child subtree dirty 0x1eac140]
2010-12-21 11:11:49.621211 7f39016ac710 mds2.cache |____ 2 auth [dir 102 ~mds2/ [2,head] auth v=27 cv=0/0 dir_auth=2 state=1610612736 f(v0 2=1+1) n(v3 rc2010-12-20 13:36:40.419762 b103004862 11=8+3) hs=1+0,ss=0+0 dirty=1 | child subtree dirty 0x1dc4120]
2010-12-21 11:11:49.621254 7f39016ac710 mds2.log _replay 9437267~1579 / 9550065 : EExport 1000000105b [metablob 1, 3 dirs]
2010-12-21 11:11:49.621264 7f39016ac710 mds2.journal EExport.replay 1000000105b
2010-12-21 11:11:49.621281 7f39016ac710 mds2.journal EMetaBlob.replay 3 dirlumps
2010-12-21 11:11:49.621288 7f39016ac710 mds2.journal EMetaBlob.replay dir 1
2010-12-21 11:11:49.621297 7f39016ac710 mds2.journal EMetaBlob.replay updated dir [dir 1 / [2,head] auth v=0 cv=0/0 dir_auth=-2 ap=0+0+2 state=1073741824 f(v7 m2010-12-16 11:52:50.836681 14=1+13) n(v1905 rc2010-12-20 13:34:29.614973 b286014461157 3444=3322+122) hs=2+0,ss=0+0 | child subtree 0x1dc4730]
2010-12-21 11:11:49.621324 7f39016ac710 mds2.journal EMetaBlob.replay for [2,head] had [dentry #1/laptop_Diderich [2,head] auth (dversion lock) v=0 ap=0+1 inode=0x7f3900836300 | inodepin 0x7f3906b271c0]
2010-12-21 11:11:49.621340 7f39016ac710 mds2.journal EMetaBlob.replay for [2,head] had [inode 100000004e9 [...2,head] /laptop_Diderich/ auth v18639 f(v7 m2010-12-20 13:36:18.534761 8=0+8) n(v538 rc2010-12-20 13:36:40.419762 b3698880281 1595=1519+76)/n(v533 rc2010-12-14 16:44:19.154780 b3698880281 1595=1519+76) (inest sync dirty) (iversion lock) | dirtyscattered dirfrag 0x7f3900836300]
2010-12-21 11:11:49.621361 7f39016ac710 mds2.journal EMetaBlob.replay dir 100000004e9
mds/events/../CDir.h: In function 'void CDir::set_version(version_t)':
mds/events/../CDir.h:189: FAILED assert(projected_fnode.empty())
ceph version 0.23 (commit:5d1d8d0c4602be9819cc9f7aea562fccbb005a56)
1: (EMetaBlob::replay(MDS*, LogSegment*)+0x3b89) [0x667669]
2: (EExport::replay(MDS*)+0x41) [0x667c61]
3: (MDLog::_replay_thread()+0x10dd) [0x65b20d]
4: (MDLog::ReplayThread::entry()+0xd) [0x4c516d]
5: (Thread::_entry_func(void*)+0x7) [0x4a6c57]
6: (()+0x6cb0) [0x7f3906898cb0]
7: (clone()+0x6d) [0x7f3905aa59dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  • Caught signal (ABRT) **
    ceph version 0.23 (commit:5d1d8d0c4602be9819cc9f7aea562fccbb005a56)
    1: (sigabrt_handler(int)+0x7d) [0x7134ad]
    2: (()+0x32610) [0x7f3905a09610]
    3: (gsignal()+0x35) [0x7f3905a09595]
    4: (abort()+0x186) [0x7f3905a0aa16]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f390628aded]
    6: (()+0xbd006) [0x7f3906289006]
    7: (()+0xbd033) [0x7f3906289033]
    8: (()+0xbd13e) [0x7f390628913e]
    9: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x448) [0x6ffba8]
    10: (EMetaBlob::replay(MDS*, LogSegment*)+0x3b89) [0x667669]
    11: (EExport::replay(MDS*)+0x41) [0x667c61]
    12: (MDLog::_replay_thread()+0x10dd) [0x65b20d]
    13: (MDLog::ReplayThread::entry()+0xd) [0x4c516d]
    14: (Thread::_entry_func(void*)+0x7) [0x4a6c57]
    15: (()+0x6cb0) [0x7f3906898cb0]
    16: (clone()+0x6d) [0x7f3905aa59dd]
Actions #3

Updated by Sage Weil over 13 years ago

Can you attach that output.txt please? We need to see what path it being taken that submits the journal item in teh first place.

Also, a patch like this should catch it (and similar bugs) when they happen):

 diff --git a/src/mds/MDLog.cc b/src/mds/MDLog.cc
index 75ddc1d..67c4c6e 100644
--- a/src/mds/MDLog.cc
+++ b/src/mds/MDLog.cc
@@ -153,6 +153,7 @@ void MDLog::append()

 void MDLog::submit_entry( LogEvent *le, Context *c, bool wait_safe ) 
 {
+  assert(!mds->is_replay());
   assert(le == cur_event);
   cur_event = NULL;

Actions #4

Updated by Sage Weil over 13 years ago

  • Target version set to v0.24.1
Actions #5

Updated by Sage Weil over 13 years ago

  • Priority changed from Normal to Urgent
Actions #6

Updated by Sage Weil over 13 years ago

  • Category set to 1
  • Assignee set to Sage Weil
Actions #7

Updated by Sage Weil over 13 years ago

added assert in commit:6c73da0a996421b108f19d7f1b56157cec281642 that will catch this sooner.

Actions #8

Updated by Sage Weil over 13 years ago

  • Target version changed from v0.24.1 to v0.24.2
Actions #9

Updated by Sage Weil over 13 years ago

  • Status changed from New to 4
Actions #10

Updated by Alexander Rødseth over 13 years ago

The ceph test-cluster has since been replaced by MooseFS. I'm sorry that I can't provide any further useful feedback for now.

Actions #11

Updated by Sage Weil about 13 years ago

  • Target version changed from v0.24.2 to v0.24.3
Actions #12

Updated by Sage Weil about 13 years ago

  • Status changed from 4 to Can't reproduce

Hopefully this will turn up later. The new asserts will let us catch it sooner and get a useful stack dump.

Actions #13

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.24.3)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF