Bug #230
OSD crash when injecting new CRUSH map
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I just wanted to take out a few OSD's since their performance was lacking, doing so took 50% of the OSD's down.
The stacktrace on all the OSD's was the same:
root@ceph04:~# gdb /usr/bin/cosd /core.ceph04.877 GNU gdb (GDB) 7.1-ubuntu Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/bin/cosd...Reading symbols from /usr/lib/debug/usr/bin/cosd...done. done. [New Thread 895] [New Thread 897] [New Thread 909] [New Thread 900] [New Thread 905] [New Thread 901] [New Thread 929] [New Thread 925] [New Thread 906] [New Thread 903] [New Thread 877] [New Thread 943] [New Thread 920] [New Thread 907] [New Thread 938] [New Thread 891] [New Thread 928] [New Thread 924] [New Thread 917] [New Thread 926] [New Thread 933] [New Thread 932] [New Thread 936] [New Thread 878] [New Thread 893] [New Thread 921] [New Thread 888] [New Thread 887] [New Thread 927] [New Thread 908] [New Thread 918] [New Thread 886] [New Thread 879] [New Thread 942] [New Thread 941] [New Thread 923] [New Thread 937] [New Thread 930] [New Thread 934] [New Thread 940] [New Thread 902] [New Thread 944] [New Thread 892] [New Thread 894] [New Thread 899] [New Thread 898] [New Thread 904] [New Thread 889] [New Thread 890] [New Thread 896] warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypto.so.0.9.8 Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libstdc++.so.6 Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libgcc_s.so.1 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libz.so.1 Core was generated by `/usr/bin/cosd -i 0 -c /tmp/ceph.conf.12074'. Program terminated with signal 6, Aborted. #0 0x00007fb0185c2a75 in raise () from /lib/libc.so.6 (gdb) bt #0 0x00007fb0185c2a75 in raise () from /lib/libc.so.6 #1 0x00007fb0185c65c0 in abort () from /lib/libc.so.6 #2 0x00007fb018e778e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007fb018e75d16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007fb018e75d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007fb018e75e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00007fb018e11737 in std::__throw_length_error(char const*) () from /usr/lib/libstdc++.so.6 #7 0x00000000004a43df in std::vector<int, std::allocator<int> >::_M_fill_insert(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, unsigned long, int const&) () #8 0x0000000000506fdf in std::vector<int, std::allocator<int> >::insert (this=<value optimized out>, rule=<value optimized out>, x=<value optimized out>, out=..., maxout=<value optimized out>, forcefeed=<value optimized out>, weight=...) at /usr/include/c++/4.4/bits/stl_vector.h:851 #9 std::vector<int, std::allocator<int> >::resize (this=<value optimized out>, rule=<value optimized out>, x=<value optimized out>, out=..., maxout=<value optimized out>, forcefeed=<value optimized out>, weight=...) at /usr/include/c++/4.4/bits/stl_vector.h:557 #10 CrushWrapper::do_rule (this=<value optimized out>, rule=<value optimized out>, x=<value optimized out>, out=..., maxout=<value optimized out>, forcefeed=<value optimized out>, weight=...) at ./crush/CrushWrapper.h:339 #11 0x000000000050714b in OSDMap::pg_to_osds(pg_t, std::vector<int, std::allocator<int> >&) () #12 0x00000000004d5d81 in OSDMap::pg_to_up_acting_osds (this=0xf925c0, t=<value optimized out>) at osd/OSDMap.h:856 #13 OSD::advance_map (this=0xf925c0, t=<value optimized out>) at osd/OSD.cc:2397 #14 0x00000000004e57cc in OSD::handle_osd_map (this=0xf925c0, m=<value optimized out>) at osd/OSD.cc:2263 #15 0x00000000004e6b50 in OSD::_dispatch (this=0xf925c0, m=0x34e9310) at osd/OSD.cc:1837 #16 0x00000000004e7539 in OSD::ms_dispatch (this=0xf925c0, m=0x34e9310) at osd/OSD.cc:1728 #17 0x0000000000460769 in Messenger::ms_deliver_dispatch (this=<value optimized out>) at msg/Messenger.h:97 #18 SimpleMessenger::dispatch_entry (this=<value optimized out>) at msg/SimpleMessenger.cc:332 #19 0x00000000004567cc in SimpleMessenger::DispatchThread::entry (this=0xf86830) at msg/SimpleMessenger.h:497 #20 0x0000000000469a4a in Thread::_entry_func (arg=0x36d) at ./common/Thread.h:39 #21 0x00007fb0194559ca in start_thread () from /lib/libpthread.so.0 #22 0x00007fb0186756cd in clone () from /lib/libc.so.6 #23 0x0000000000000000 in ?? () (gdb)
The new CRUSH map:
# begin crush map # devices device 0 device0 device 1 device1 device 2 device2 device 3 device3 device 4 device4 device 5 device5 device 6 device6 device 7 device7 # types type 0 device type 1 domain type 2 pool # buckets domain root { id -1 # do not change unnecessarily alg straw hash 0 # rjenkins1 item device0 weight 1.000 item device1 weight 1.000 item device4 weight 1.000 item device7 weight 1.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take root step choose firstn 0 type device step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take root step choose firstn 0 type device step emit } rule casdata { ruleset 2 type replicated min_size 1 max_size 10 step take root step choose firstn 0 type device step emit } rule rbd { ruleset 3 type replicated min_size 1 max_size 10 step take root step choose firstn 0 type device step emit } # end crush mapThe logs of two OSD's:
History
#1 Updated by Sage Weil almost 14 years ago
- Status changed from New to Resolved
This was a problem with teh CrushWrapper error handling (the error was due to a forcefed device no longer existing in your crush map). Fixed by 8f2731bc02ef39bf533ddf17fc514d3cf9193dad.