Project

General

Profile

Actions

Bug #367

closed

OSD crash: CrushWrapper::do_rule

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I tried to create a crushrule where a pool would run on just one OSD, this for some performance testing.

Loading the crushmap went fine, but when creating a pool with that crush rule resulted in 5 of the 12 OSD's crashing.

The crushmap:

 domain  kvmimg {
        id -2           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item  device0 weight 1.000
}
..
..
..
rule  kvm {
        ruleset 4
        type replicated
        min_size 1
        max_size 10
        step take  kvmimg
        step choose firstn 0 type  device
        step emit
}

On 4 OSD's the backtrace was almost the same:

Core was generated by `/usr/bin/cosd -i 11 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f5edf7e1a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007f5edf7e1a75 in raise () from /lib/libc.so.6
#1  0x00007f5edf7e55c0 in abort () from /lib/libc.so.6
#2  0x00007f5edf7da941 in __assert_fail () from /lib/libc.so.6
#3  0x0000000000573b5c in crush_do_rule ()
#4  0x000000000050fbf5 in CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, int, std::vector<unsigned int, std::allocator<unsigned int> >&) ()
#5  0x0000000000512eab in OSDMap::pg_to_osds(pg_t, std::vector<int, std::allocator<int> >&) ()
#6  0x00000000004df562 in OSD::advance_map(ObjectStore::Transaction&) ()
#7  0x00000000004e3836 in OSD::handle_osd_map(MOSDMap*) ()
#8  0x00000000004ef978 in OSD::_dispatch(Message*) ()
#9  0x00000000004f03f9 in OSD::ms_dispatch(Message*) ()
#10 0x0000000000462449 in SimpleMessenger::dispatch_entry() ()
#11 0x000000000045930c in SimpleMessenger::DispatchThread::entry() ()
#12 0x000000000046d30a in Thread::_entry_func(void*) ()
#13 0x00007f5ee08dc9ca in start_thread () from /lib/libpthread.so.0
#14 0x00007f5edf8946fd in clone () from /lib/libc.so.6
#15 0x0000000000000000 in ?? ()
(gdb)

While one OSD had the following backtrace:

Core was generated by `/usr/bin/cosd -i 7 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007fbcd7411a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007fbcd7411a75 in raise () from /lib/libc.so.6
#1  0x00007fbcd74155c0 in abort () from /lib/libc.so.6
#2  0x00007fbcd7cc68e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007fbcd7cc4d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007fbcd7cc4d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007fbcd7cc4e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x0000000000544a79 in decode(std::string&, ceph::buffer::list::iterator&) ()
#7  0x00000000005311f8 in PG::read_log(ObjectStore*) ()
#8  0x00000000005331d6 in PG::read_state(ObjectStore*) ()
#9  0x00000000004e6675 in OSD::load_pgs() ()
#10 0x00000000004e7058 in OSD::init() ()
#11 0x00000000004580e2 in main ()
(gdb) 

When going through the current/meta data i found that the OSD maps all contain this wrong crushrule, which keeps crashing the OSD's. I'm going to try to inject all the OSD maps (on all the OSD's, since my current OSD's are behind on the epoch) and then start it again.

But like said before ( #228 ) a new crushmap should not crash the cluster.

I've uploaded all the core files and logfiles to logger.ceph.widodh.nl in the directory /srv/ceph/issues/osd_crash_19_aug

Actions

Also available in: Atom PDF