Project

General

Profile

Bug #230

OSD crash when injecting new CRUSH map

Added by Wido den Hollander almost 14 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I just wanted to take out a few OSD's since their performance was lacking, doing so took 50% of the OSD's down.

The stacktrace on all the OSD's was the same:

root@ceph04:~# gdb /usr/bin/cosd /core.ceph04.877 
GNU gdb (GDB) 7.1-ubuntu
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying" 
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/cosd...Reading symbols from /usr/lib/debug/usr/bin/cosd...done.
done.
[New Thread 895]
[New Thread 897]
[New Thread 909]
[New Thread 900]
[New Thread 905]
[New Thread 901]
[New Thread 929]
[New Thread 925]
[New Thread 906]
[New Thread 903]
[New Thread 877]
[New Thread 943]
[New Thread 920]
[New Thread 907]
[New Thread 938]
[New Thread 891]
[New Thread 928]
[New Thread 924]
[New Thread 917]
[New Thread 926]
[New Thread 933]
[New Thread 932]
[New Thread 936]
[New Thread 878]
[New Thread 893]
[New Thread 921]
[New Thread 888]
[New Thread 887]
[New Thread 927]
[New Thread 908]
[New Thread 918]
[New Thread 886]
[New Thread 879]
[New Thread 942]
[New Thread 941]
[New Thread 923]
[New Thread 937]
[New Thread 930]
[New Thread 934]
[New Thread 940]
[New Thread 902]
[New Thread 944]
[New Thread 892]
[New Thread 894]
[New Thread 899]
[New Thread 898]
[New Thread 904]
[New Thread 889]
[New Thread 890]
[New Thread 896]

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypto.so.0.9.8
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libz.so.1
Core was generated by `/usr/bin/cosd -i 0 -c /tmp/ceph.conf.12074'.
Program terminated with signal 6, Aborted.
#0  0x00007fb0185c2a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007fb0185c2a75 in raise () from /lib/libc.so.6
#1  0x00007fb0185c65c0 in abort () from /lib/libc.so.6
#2  0x00007fb018e778e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007fb018e75d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007fb018e75d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007fb018e75e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00007fb018e11737 in std::__throw_length_error(char const*) () from /usr/lib/libstdc++.so.6
#7  0x00000000004a43df in std::vector<int, std::allocator<int> >::_M_fill_insert(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, unsigned long, int const&) ()
#8  0x0000000000506fdf in std::vector<int, std::allocator<int> >::insert (this=<value optimized out>, 
    rule=<value optimized out>, x=<value optimized out>, out=..., maxout=<value optimized out>, 
    forcefeed=<value optimized out>, weight=...) at /usr/include/c++/4.4/bits/stl_vector.h:851
#9  std::vector<int, std::allocator<int> >::resize (this=<value optimized out>, rule=<value optimized out>, 
    x=<value optimized out>, out=..., maxout=<value optimized out>, forcefeed=<value optimized out>, weight=...)
    at /usr/include/c++/4.4/bits/stl_vector.h:557
#10 CrushWrapper::do_rule (this=<value optimized out>, rule=<value optimized out>, x=<value optimized out>, out=..., 
    maxout=<value optimized out>, forcefeed=<value optimized out>, weight=...) at ./crush/CrushWrapper.h:339
#11 0x000000000050714b in OSDMap::pg_to_osds(pg_t, std::vector<int, std::allocator<int> >&) ()
#12 0x00000000004d5d81 in OSDMap::pg_to_up_acting_osds (this=0xf925c0, t=<value optimized out>) at osd/OSDMap.h:856
#13 OSD::advance_map (this=0xf925c0, t=<value optimized out>) at osd/OSD.cc:2397
#14 0x00000000004e57cc in OSD::handle_osd_map (this=0xf925c0, m=<value optimized out>) at osd/OSD.cc:2263
#15 0x00000000004e6b50 in OSD::_dispatch (this=0xf925c0, m=0x34e9310) at osd/OSD.cc:1837
#16 0x00000000004e7539 in OSD::ms_dispatch (this=0xf925c0, m=0x34e9310) at osd/OSD.cc:1728
#17 0x0000000000460769 in Messenger::ms_deliver_dispatch (this=<value optimized out>) at msg/Messenger.h:97
#18 SimpleMessenger::dispatch_entry (this=<value optimized out>) at msg/SimpleMessenger.cc:332
#19 0x00000000004567cc in SimpleMessenger::DispatchThread::entry (this=0xf86830) at msg/SimpleMessenger.h:497
#20 0x0000000000469a4a in Thread::_entry_func (arg=0x36d) at ./common/Thread.h:39
#21 0x00007fb0194559ca in start_thread () from /lib/libpthread.so.0
#22 0x00007fb0186756cd in clone () from /lib/libc.so.6
#23 0x0000000000000000 in ?? ()
(gdb) 

The new CRUSH map:

# begin crush map

# devices
device 0 device0
device 1 device1
device 2 device2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 device7

# types
type 0 device
type 1 domain
type 2 pool

# buckets
domain root {
    id -1        # do not change unnecessarily
    alg straw
    hash 0    # rjenkins1
    item device0 weight 1.000
    item device1 weight 1.000
    item device4 weight 1.000
    item device7 weight 1.000
}

# rules
rule data {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take root
    step choose firstn 0 type device
    step emit
}
rule metadata {
    ruleset 1
    type replicated
    min_size 1
    max_size 10
    step take root
    step choose firstn 0 type device
    step emit
}
rule casdata {
    ruleset 2
    type replicated
    min_size 1
    max_size 10
    step take root
    step choose firstn 0 type device
    step emit
}
rule rbd {
    ruleset 3
    type replicated
    min_size 1
    max_size 10
    step take root
    step choose firstn 0 type device
    step emit
}

# end crush map
The logs of two OSD's:

History

#1 Updated by Sage Weil almost 14 years ago

  • Status changed from New to Resolved

This was a problem with teh CrushWrapper error handling (the error was due to a forcefed device no longer existing in your crush map). Fixed by 8f2731bc02ef39bf533ddf17fc514d3cf9193dad.

Also available in: Atom PDF