Bug #1846: Mds crash immediately after start (segmentation fault) - Ceph - Ceph

Actions

Copy link

Bug #1846

closed

Mds crash immediately after start (segmentation fault)

Added by Maciej Galkiewicz over 12 years ago. Updated over 12 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

v0.40

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have two mds' in my configuration. One of them works fine and the other crashes immediately after reboot:

@2011-12-20 12:21:09.035502 7f0bce69e700 mds.-1.0 ms_handle_connect on 1.1.1.1:6789/0

Caught signal (Segmentation fault) *
in thread 7f0bce69e700
ceph version 0.39 (321ecdaba2ceeddb0789d8f4b7180a8ea5785d83)
1: /usr/bin/ceph-mds() [0x7b80b9]
2: (()+0xef60) [0x7f0bd2109f60]
3: (std::string::assign(std::string const&)+0x6d) [0x7f0bd115cfbd]
4: (MonMap::calc_ranks()+0x26a) [0x713e9a]
5: (MonMap::decode(ceph::buffer::list::iterator&)+0x36b) [0x7143ab]
6: (MonClient::handle_monmap(MMonMap)+0x13b) [0x70c92b]
7: (MonClient::ms_dispatch(Message*)+0x257) [0x70d197]
8: (SimpleMessenger::dispatch_entry()+0x869) [0x74e4b9]
9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x498f9c]
10: (()+0x68ba) [0x7f0bd21018ba]
11: (clone()+0x6d) [0x7f0bd098a02d]@

Files

monmap (474 Bytes) monmap

Maciej Galkiewicz, 12/20/2011 08:38 AM

Actions

Copy link

Updated by Maciej Galkiewicz over 12 years ago

In the same way crashes osd on this machine.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version set to v0.40

can you 'ceph mon getmap -o /tmp/monmap' and attach that file to this bug?

Actions

Copy link

Updated by Maciej Galkiewicz over 12 years ago

File monmap monmap added

I got monmap from machine with working mds cause the other one does not have admin key. I hope that this is not a problem.

Actions

Copy link

Updated by Maciej Galkiewicz over 12 years ago

Do you have any suggestions how to temporary workaround this problem?

Actions

Copy link

Updated by Sage Weil over 12 years ago

I looked at the attached monmap and didn't see anything odd. This fully reproducible, I take it? That's good news.

Can you try running it through valgrind, with some extra logging? 'valgrind ceph-mds -i <id> -f --debug-ms 1 --debug-mds 20' or similar, and see if what warnings come up? Also, if a core file is generated, then a backtrace from gdb ('gdb /usr/bin/ceph-mds core' and 'bt') will tell us a bit more.

Thanks!

Actions

Copy link

Updated by Maciej Galkiewicz over 12 years ago

# valgrind ceph-mds -i n3c1 -f --debug-ms 1 --debug-mds 20
==10225== Memcheck, a memory error detector
==10225== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==10225== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==10225== Command: ceph-mds -i n3c1 -f --debug-ms 1 --debug-mds 20
==10225== 
==10225== Invalid read of size 1
==10225==    at 0x5B3ECA0: base::VDSOSupport::ElfMemImage::Init(void const*) (in /usr/lib/libtcmalloc.so.0.0.0)
==10225==    by 0x5B3F2F2: base::VDSOSupport::Init() (in /usr/lib/libtcmalloc.so.0.0.0)
==10225==    by 0x5B40CB5: ??? (in /usr/lib/libtcmalloc.so.0.0.0)
==10225==    by 0x5B24012: ??? (in /usr/lib/libtcmalloc.so.0.0.0)
==10225==    by 0x7FF000837: ???
==10225==    by 0x3163336E00692CFF: ???
==10225==    by 0x65642D2D00662CFF: ???
==10225==    by 0x3100736D2D677561: ???
==10225==    by 0x67756265642D2CFF: ???
==10225==    by 0x30320073646D2C: ???
==10225==    by 0x4449475F4F445552: ???
==10225==    by 0x524553550030353C: ???
==10225==  Address 0x7fff9706a000 is not stack'd, malloc'd or (recently) free'd
==10225== 
==10225== 
==10225== Process terminating with default action of signal 11 (SIGSEGV)
==10225==  Access not within mapped region at address 0x7FFF9706A000
==10225==    at 0x5B3ECA0: base::VDSOSupport::ElfMemImage::Init(void const*) (in /usr/lib/libtcmalloc.so.0.0.0)
==10225==    by 0x5B3F2F2: base::VDSOSupport::Init() (in /usr/lib/libtcmalloc.so.0.0.0)
==10225==    by 0x5B40CB5: ??? (in /usr/lib/libtcmalloc.so.0.0.0)
==10225==    by 0x5B24012: ??? (in /usr/lib/libtcmalloc.so.0.0.0)
==10225==    by 0x7FF000837: ???
==10225==    by 0x3163336E00692CFF: ???
==10225==    by 0x65642D2D00662CFF: ???
==10225==    by 0x3100736D2D677561: ???
==10225==    by 0x67756265642D2CFF: ???
==10225==    by 0x30320073646D2C: ???
==10225==    by 0x4449475F4F445552: ???
==10225==    by 0x524553550030353C: ???
==10225==  If you believe this happened as a result of a stack
==10225==  overflow in your program's main thread (unlikely but
==10225==  possible), you can try to increase the size of the
==10225==  main thread stack using the --main-stacksize= flag.
==10225==  The main thread stack size used in this run was 8388608.
==10225== 
==10225== HEAP SUMMARY:
==10225==     in use at exit: 0 bytes in 0 blocks
==10225==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==10225== 
==10225== All heap blocks were freed -- no leaks are possible
==10225== 
==10225== For counts of detected and suppressed errors, rerun with: -v
==10225== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
Segmentation fault

Not sure if this is what you wanted:

# gdb --args ceph-mds -i n3c1 -f --debug-ms 1 --debug-mds 20
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying" 
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/ceph-mds...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/bin/ceph-mds -i n3c1 -f --debug-ms 1 --debug-mds 20
[Thread debugging using libthread_db enabled]
 ** WARNING: Ceph is still under development.  Any feedback can be directed  **
 **          at ceph-devel@vger.kernel.org or http://ceph.newdream.net/.     **
starting mds.n3c1 at 0.0.0.0:6800/11209
[New Thread 0x7ffff616a700 (LWP 11212)]
[New Thread 0x7ffff5969700 (LWP 11213)]
[New Thread 0x7ffff5168700 (LWP 11214)]
[New Thread 0x7ffff4967700 (LWP 11215)]
[New Thread 0x7ffff4166700 (LWP 11216)]
[New Thread 0x7ffff3965700 (LWP 11217)]
[New Thread 0x7ffff3164700 (LWP 11218)]
[New Thread 0x7ffff7fec700 (LWP 11219)]
[New Thread 0x7ffff2963700 (LWP 11220)]
[Thread 0x7ffff7fec700 (LWP 11219) exited]
[New Thread 0x7ffff7fec700 (LWP 11221)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff4166700 (LWP 11216)]
0x00007ffff6c24fbd in std::string::assign(std::string const&) () from /usr/lib/libstdc++.so.6
(gdb) core
No core file now.
(gdb) generate-core-file 
Saved corefile core.11209
(gdb) bt
#0  0x00007ffff6c24fbd in std::string::assign(std::string const&) () from /usr/lib/libstdc++.so.6
#1  0x0000000000713e9a in MonMap::calc_ranks() ()
#2  0x00000000007143ab in MonMap::decode(ceph::buffer::list::iterator&) ()
#3  0x000000000070c92b in MonClient::handle_monmap(MMonMap*) ()
#4  0x000000000070d197 in MonClient::ms_dispatch(Message*) ()
#5  0x000000000074e4b9 in SimpleMessenger::dispatch_entry() ()
#6  0x0000000000498f9c in SimpleMessenger::DispatchThread::entry() ()
#7  0x00007ffff7bc98ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#8  0x00007ffff645202d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#9  0x0000000000000000 in ?? ()

Actions

Copy link