Project

General

Profile

Actions

Bug #1104

closed

Segmentation fault when deleting a folder

Added by Bernard Grymonpon almost 13 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

got this after removing a just created folder:

2011-05-20 18:19:09.679553 7f8254c89700 mds0.18 handle_mds_map i am now mds0.18
2011-05-20 18:19:09.679572 7f8254c89700 mds0.18 handle_mds_map state change up:rejoin --> up:active
2011-05-20 18:19:09.679577 7f8254c89700 mds0.18 recovery_done -- successful recovery!
2011-05-20 18:19:09.679907 7f8254c89700 mds0.18 active_start
2011-05-20 18:19:09.682956 7f8254c89700 mds0.18 cluster recovered.
  • Caught signal (Segmentation fault) *
    in thread 0x7f8254c89700
    ceph version 0.28-112-g6f8708b (commit:6f8708baec1999b1bc0bad3ad5c6130d7e0d3e1d)
    1: /usr/bin/cmds() [0x6f8792]
    2: (()+0xef60) [0x7f82572e6f60]
    3: (MDCache::get_or_create_stray_dentry(CInode
    )+0x25) [0x5273e5]
    4: (Server::handle_client_unlink(MDRequest*)+0x997) [0x4ff3e7]
    5: (Server::handle_client_request(MClientRequest*)+0x543) [0x5090b3]
    6: (MDS::handle_deferrable_message(Message*)+0x99f) [0x49448f]
    7: (MDS::_dispatch(Message*)+0x144a) [0x4a581a]
    8: (MDS::ms_dispatch(Message*)+0x57) [0x4a5fd7]
    9: (SimpleMessenger::dispatch_entry()+0x7da) [0x6d00ba]
    10: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x484f6c]
    11: (()+0x68ba) [0x7f82572de8ba]
    12: (clone()+0x6d) [0x7f8255f7302d]

More info:

Setup: 3 node ceph testcluster, was running .26-something, upgraded today to the latest master branch (git pull, got me up to 6f8708baec1999b1bc0bad3ad5c6130d7e0d3e1d, made debian packages, and replaced all packages on all cluster nodes, and restarted ceph everywhere (after changing the config file to include the "." in the names).

All nodes run 3 OSDs, one mon and the two first nodes run a mds.

On a client I mounted the ceph filesystem, made a folder and then removed it:

root@dhcp114:~# mount -t ceph ceph-001.om:/ /mnt/root@dhcp114:~# cd /mnt
root@dhcp114:/mnt# ls
bonnie f1 f2 f3 f4 f5 foo
root@dhcp114:/mnt# mkdir test
root@dhcp114:/mnt# rmdir test

This hanged. On the ceph cluster, i got the segmentation fault on all mds'.


Files

mds.0.log.bz2 (14.3 MB) mds.0.log.bz2 Bernard Grymonpon, 05/20/2011 09:50 AM
cmds.bz2 (14 MB) cmds.bz2 Sage Weil, 05/24/2011 10:15 AM
Actions #1

Updated by Bernard Grymonpon almost 13 years ago

Logfile from the first mds, as asked:

18:25 < sage> great. add
18:25 < sage> debug mds = 20
18:25 < sage> debug ms = 1
18:25 < sage> to your [mds] section and then reproduce!
18:25 < sage> and attach teh log to the bug

Actions #2

Updated by Sage Weil almost 13 years ago

  • Category set to 1
  • Target version set to v0.29
Actions #3

Updated by Fyodor Ustinov almost 13 years ago

I can not attach files to this issue.

http://blog.ufm.su/core.zip - core file
http://blog.ufm.su/mds.zip - log file

core backtrace:

Core was generated by `/usr/bin/cmds -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f66ab9a8b3b in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
42      ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
        in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
(gdb) bt
#0  0x00007f66ab9a8b3b in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x0000000000711dd3 in ?? ()
#2  0x0000000000712e7b in ?? ()
#3  <signal handler called>
#4  0x00000000005356f5 in MDCache::get_or_create_stray_dentry(CInode*) ()
#5  0x0000000000508857 in Server::handle_client_unlink(MDRequest*) ()
#6  0x0000000000520852 in Server::handle_client_request(MClientRequest*) ()
#7  0x00000000004a266f in MDS::handle_deferrable_message(Message*) ()
#8  0x00000000004b617e in MDS::_dispatch(Message*) ()
#9  0x00000000004b66c9 in MDS::ms_dispatch(Message*) ()
#10 0x00000000004838aa in SimpleMessenger::dispatch_entry() ()
#11 0x000000000047b26c in SimpleMessenger::DispatchThread::entry() ()
#12 0x00007f66ab99fd8c in start_thread (arg=0x7f66a9769700) at pthread_create.c:304
#13 0x00007f66aa85204d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()
(gdb) 

Actions #4

Updated by Sage Weil almost 13 years ago

  • Assignee set to Sage Weil
Actions #5

Updated by Sage Weil almost 13 years ago

  • Status changed from New to 4

Can you try with this patch applied?


diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 521be12..9a51530 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -127,6 +127,7 @@ MDCache::MDCache(MDS *m)
   root = NULL;
   myin = NULL;

+  stray_index = 0;
   for (int i = 0; i < NUM_STRAY; ++i) {
     strays[i] = NULL;
   }
Actions #6

Updated by Fyodor Ustinov almost 13 years ago

Sage Weil wrote:

Can you try with this patch applied?

It's 0.28.1 or I should compile master branch?

Actions #7

Updated by Sage Weil almost 13 years ago

the 'stable' branch has that fix, or you can apply it manually...

Actions #8

Updated by Fyodor Ustinov almost 13 years ago

Sage Weil wrote:

the 'stable' branch has that fix, or you can apply it manually...

Published in your repository 0.28.1 not fix this issue.

Sage, I understand that I am insolent, but I not have stand for compile ceph from sources. As I understand - this patch require recompile one cmds? Can you sent to me compiled cmds with this patch?

Actions #9

Updated by Sage Weil almost 13 years ago

Attached! You may have problems if your libraries don't match mine. There are also the autobuilt debian packages that should be finished building shortly.

Actions #10

Updated by Fyodor Ustinov almost 13 years ago

Compiled from last master sources (sorry, forgot switch to stable branch) not have this trouble. Hooray? Maybe it makes sense to release 0.28.2?

Actions #11

Updated by Sage Weil almost 13 years ago

  • Status changed from 4 to Resolved

Yay! Thanks for your help testing. We'll do 0.28.2 in a few days.

Actions #12

Updated by Bernard Grymonpon almost 13 years ago

Tried the stable branch (i'm at ce04e3dbaf2383a521b267585a860f772c4cc786), made debian packages, installed it all, still crashes (although somewhere else).

2011-05-24 20:23:18.315343 7f11ae9f5700 mds0.cache creating system inode with ino:100
2011-05-24 20:23:18.315470 7f11ae9f5700 mds0.cache creating system inode with ino:1
2011-05-24 20:23:18.316507 7f11ae9f5700 mds0.27 ms_handle_connect on 10.1.10.180:6807/25945
2011-05-24 20:23:18.719931 7f11ae9f5700 mds0.27 ms_handle_connect on 10.1.10.182:6806/2430
2011-05-24 20:23:19.147062 7f11ae9f5700 mds0.27 ms_handle_connect on 10.1.10.182:6800/2240
2011-05-24 20:23:19.881867 7f11abae2700 mds0.27 replay_done
2011-05-24 20:23:19.881909 7f11abae2700 mds0.27 making mds journal writeable
osdc/Journaler.cc: In function 'void Journaler::_prezeroed(int, uint64_t, uint64_t)', in thread '0x7f11ae9f5700'
osdc/Journaler.cc: 649: FAILED assert(r == 0)
ceph version 0.28.1-3-gce04e3d (commit:ce04e3dbaf2383a521b267585a860f772c4cc786)
1: (Journaler::_prezeroed(int, unsigned long, unsigned long)+0x6bb) [0x6be4ab]
2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x981) [0x69cf11]
3: (MDS::handle_core_message(Message*)+0x7cf) [0x4bc5ef]
4: (MDS::_dispatch(Message*)+0x282) [0x4bc8f2]
5: (MDS::ms_dispatch(Message*)+0x57) [0x4be2f7]
6: (SimpleMessenger::dispatch_entry()+0x7da) [0x48d3aa]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x484f3c]
8: (()+0x68ba) [0x7f11b104a8ba]
9: (clone()+0x6d) [0x7f11afcdf02d]
ceph version 0.28.1-3-gce04e3d (commit:ce04e3dbaf2383a521b267585a860f772c4cc786)
1: (Journaler::_prezeroed(int, unsigned long, unsigned long)+0x6bb) [0x6be4ab]
2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x981) [0x69cf11]
3: (MDS::handle_core_message(Message*)+0x7cf) [0x4bc5ef]
4: (MDS::_dispatch(Message*)+0x282) [0x4bc8f2]
5: (MDS::ms_dispatch(Message*)+0x57) [0x4be2f7]
6: (SimpleMessenger::dispatch_entry()+0x7da) [0x48d3aa]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x484f3c]
8: (()+0x68ba) [0x7f11b104a8ba]
9: (clone()+0x6d) [0x7f11afcdf02d]
  • Caught signal (Aborted) *
    in thread 0x7f11ae9f5700
    ceph version 0.28.1-3-gce04e3d (commit:ce04e3dbaf2383a521b267585a860f772c4cc786)
    1: /usr/bin/cmds() [0x71cb82]
    2: (()+0xef60) [0x7f11b1052f60]
    3: (gsignal()+0x35) [0x7f11afc42165]
    4: (abort()+0x180) [0x7f11afc44f70]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x115) [0x7f11b04d5dc5]
    6: (()+0xcb166) [0x7f11b04d4166]
    7: (()+0xcb193) [0x7f11b04d4193]
    8: (()+0xcb28e) [0x7f11b04d428e]
    9: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x373) [0x700e53]
    10: (Journaler::_prezeroed(int, unsigned long, unsigned long)+0x6bb) [0x6be4ab]
    11: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x981) [0x69cf11]
    12: (MDS::handle_core_message(Message*)+0x7cf) [0x4bc5ef]
    13: (MDS::_dispatch(Message*)+0x282) [0x4bc8f2]
    14: (MDS::ms_dispatch(Message*)+0x57) [0x4be2f7]
    15: (SimpleMessenger::dispatch_entry()+0x7da) [0x48d3aa]
    16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x484f3c]
    17: (()+0x68ba) [0x7f11b104a8ba]
    18: (clone()+0x6d) [0x7f11afcdf02d]

If you need more log (with higher debug levels), let me know!

Actions #13

Updated by Sage Weil almost 13 years ago

Can you check with gdb to see what the value of 'r' actually is?

Actions #14

Updated by Bernard Grymonpon almost 13 years ago

I'll have to rebuild everything, "r" it is optimized out in my build. This will take a little longer...

#6 0x0000000000700e53 in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>,
line=<value optimized out>, func=0x768f60 "void Journaler::_prezeroed(int, uint64_t, uint64_t)") at common/assert.cc:86
#7 0x00000000006be4ab in Journaler::_prezeroed (this=0xca2340, r=<value optimized out>, start=176160768, len=<value optimized out>)
at osdc/Journaler.cc:649
#8 0x000000000069cf11 in Objecter::handle_osd_op_reply (this=0xc8b240, m=0x405a000) at osdc/Objecter.cc:799
#9 0x00000000004bc5ef in MDS::handle_core_message (this=0xc97a00, m=0x405a000) at mds/MDS.cc:1677

Actions #15

Updated by Bernard Grymonpon almost 13 years ago

There we go:

[Switching to Thread 0x7ffff5574700 (LWP 27162)]
0x00007ffff67c1165 in raise () from /lib/libc.so.6
(gdb) bt
#0 0x00007ffff67c1165 in raise () from /lib/libc.so.6
#1 0x00007ffff67c3f70 in abort () from /lib/libc.so.6
#2 0x00007ffff7054dc5 in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3 0x00007ffff7053166 in ?? () from /usr/lib/libstdc++.so.6
#4 0x00007ffff7053193 in std::terminate() () from /usr/lib/libstdc++.so.6
#5 0x00007ffff705328e in _cxa_throw () from /usr/lib/libstdc++.so.6
#6 0x0000000000a5844a in ceph::
_ceph_assert_fail (assertion=0xac9edf "r == 0", file=0xaca09b "osdc/Journaler.cc", line=649,
func=0xacb3a0 "void Journaler::_prezeroed(int, uint64_t, uint64_t)") at common/assert.cc:86
#7 0x0000000000a0113d in Journaler::_prezeroed (this=0x10ec340, r=-2, start=180355072, len=4194304) at osdc/Journaler.cc:649
#8 0x0000000000a05037 in C_Journaler_Prezero::finish (this=0x695bce0, r=-2) at osdc/Journaler.cc:602
#9 0x00000000009c2d77 in Objecter::handle_osd_op_reply (this=0x10d5240, m=0x1116a80) at osdc/Objecter.cc:799
#10 0x000000000077a4f3 in MDS::handle_core_message (this=0x10e1a00, m=0x1116a80) at mds/MDS.cc:1677
#11 0x000000000077bc41 in MDS::_dispatch (this=0x10e1a00, m=0x1116a80) at mds/MDS.cc:1792
#12 0x0000000000779a7e in MDS::ms_dispatch (this=0x10e1a00, m=0x1116a80) at mds/MDS.cc:1613
#13 0x000000000074dbfd in Messenger::ms_deliver_dispatch (this=0x10e1000, m=0x1116a80) at msg/Messenger.h:98
#14 0x000000000073abe9 in SimpleMessenger::dispatch_entry (this=0x10e1000) at msg/SimpleMessenger.cc:353
#15 0x0000000000730c92 in SimpleMessenger::DispatchThread::entry (this=0x10e1488) at ./msg/SimpleMessenger.h:544
#16 0x000000000074cadf in Thread::_entry_func (arg=0x10e1488) at ./common/Thread.h:41
#17 0x00007ffff7bc98ba in start_thread () from /lib/libpthread.so.0
#18 0x00007ffff685e02d in clone () from /lib/libc.so.6
#19 0x0000000000000000 in ?? ()
(gdb)

r seems to be -2. Let me know if you need anything else.

Actions #16

Updated by Sage Weil almost 13 years ago

cherry-picked commit:7330c3c473aa128b1e3ecb8752278f655bc79620 to stable. i'm a bit surprised you're seeing this on the stable branch (it shouldn't have come up without commit:d2243e822142b319d7d99865b0ea9733dfa73cdd.. maybe you're running master branch on the osds?)

Actions #17

Updated by Bernard Grymonpon almost 13 years ago

I'll try it first thing tomorrow, no more access to the machines now - everything is always updated completely on all machines (osds, mds, cos...). I'm pretty sure I took the stable branch; I'll double check if 7330c3c473aa128b1e3ecb8752278f655bc79620 is in. Feel free to give me some homework I could try :-).

Actions #18

Updated by Bernard Grymonpon almost 13 years ago

Fixed!

Pulled in the latest changes, recompiled, and works like a charm now.

Actions #19

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.29)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF