Bug #272
closedMonitor crashes during replication level change
0%
Description
Today i tried to change my replication level from 2 to 3 on my "data" and "metadata" pools.
This caused various crashes on my cluster (MDS, MON and OSD), but in this issue i'm reporting to monitor crashes i saw during the proces.
I tried to start the MON's again, but they would crash again after a few minutes. During this proces there were no I/O's to the filesystem, but it was in a degraded state (40%) since the level had changed.
The backtraces from mon0
root@node13:~# ls -al /core* -rw------- 1 root root 252190720 2010-07-12 16:53 /core.node13.1636 -rw------- 1 root root 249643008 2010-07-12 16:56 /core.node13.3359 root@node13:~#
Core was generated by `/usr/bin/cmon -i 0 -c /etc/ceph/ceph.conf'. Program terminated with signal 6, Aborted. #0 0x00007ff320b82a75 in raise () from /lib/libc.so.6 (gdb) bt #0 0x00007ff320b82a75 in raise () from /lib/libc.so.6 #1 0x00007ff320b865c0 in abort () from /lib/libc.so.6 #2 0x00007ff3214378e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007ff321435d16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007ff321435d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007ff321435e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00007ff3214362cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6 #7 0x00007ff321436389 in operator new[](unsigned long) () from /usr/lib/libstdc++.so.6 #8 0x00000000004e42c2 in raw_char (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>) at ./include/buffer.h:154 #9 ceph::buffer::create (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>) at ./include/buffer.h:293 #10 ptr (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>) at ./include/buffer.h:331 #11 MonitorStore::get_bl_ss (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>) at mon/MonitorStore.cc:217 #12 0x000000000048c4dd in MonitorStore::get_bl_sn (this=<value optimized out>, from=<value optimized out>) at mon/MonitorStore.h:60 #13 OSDMonitor::build_incremental (this=<value optimized out>, from=<value optimized out>) at mon/OSDMonitor.cc:761 #14 0x000000000048cbfb in OSDMonitor::check_sub (this=0x664, sub=0x7ff31beb2210) at mon/OSDMonitor.cc:808 #15 0x000000000046affc in Monitor::handle_subscribe (this=<value optimized out>, m=<value optimized out>) at mon/Monitor.cc:787 #16 0x000000000046c78f in Monitor::_ms_dispatch (this=<value optimized out>, m=0x59ca3e0) at mon/Monitor.cc:651 #17 0x0000000000478287 in Monitor::ms_dispatch(Message*) () #18 0x0000000000452b89 in Messenger::ms_deliver_dispatch (this=0x1f58e50) at msg/Messenger.h:97 #19 SimpleMessenger::dispatch_entry (this=0x1f58e50) at msg/SimpleMessenger.cc:342 #20 0x000000000044877c in SimpleMessenger::DispatchThread::entry (this=0x1f592d8) at msg/SimpleMessenger.h:534 #21 0x000000000045bcba in Thread::_entry_func (arg=0x664) at ./common/Thread.h:39 #22 0x00007ff321a159ca in start_thread () from /lib/libpthread.so.0 #23 0x00007ff320c356cd in clone () from /lib/libc.so.6 #24 0x0000000000000000 in ?? () (gdb)
Then, the second backtrace from mon0
Core was generated by `/usr/bin/cmon -i 0 -c /etc/ceph/ceph.conf'. Program terminated with signal 6, Aborted. #0 0x00007fca3ba71a75 in raise () from /lib/libc.so.6 (gdb) bt #0 0x00007fca3ba71a75 in raise () from /lib/libc.so.6 #1 0x00007fca3ba755c0 in abort () from /lib/libc.so.6 #2 0x00007fca3c3268e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007fca3c324d16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007fca3c324d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007fca3c324e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00007fca3c3252cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6 #7 0x00000000004a0563 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, ceph::buffer::list>, std::_Select1st<std::pair<unsigned int const, ceph::buffer::list> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::_M_insert_(std::_Rb_tree_node_base const*, std::_Rb_tree_node_base const*, std::pair<unsigned int const, ceph::buffer::list> const&) () #8 0x00000000004a088b in std::_Rb_tree<unsigned int, std::pair<unsigned int const, ceph::buffer::list>, std::_Select1st<std::pair<unsigned int const, ceph::buffer::list> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::_M_insert_unique_(std::_Rb_tree_const_iterator<std::pair<unsigned int const, ceph::buffer::list> >, std::pair<unsigned int const, ceph::buffer::list> const&) () #9 0x000000000048c6d0 in std::map<unsigned int, ceph::buffer::list, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::insert (this=<value optimized out>, from=<value optimized out>) at /usr/include/c++/4.4/bits/stl_map.h:540 #10 std::map<unsigned int, ceph::buffer::list, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::operator[] (this=<value optimized out>, from=<value optimized out>) at /usr/include/c++/4.4/bits/stl_map.h:450 #11 OSDMonitor::build_incremental (this=<value optimized out>, from=<value optimized out>) at mon/OSDMonitor.cc:763 #12 0x000000000048cbfb in OSDMonitor::check_sub (this=0xd1f, sub=0x7fca2bb571f0) at mon/OSDMonitor.cc:808 #13 0x000000000048cefe in OSDMonitor::check_subs (this=0x16a9240) at mon/OSDMonitor.cc:799 #14 0x000000000048d6dc in OSDMonitor::update_from_paxos (this=0x16a9240) at mon/OSDMonitor.cc:125 #15 0x0000000000482ba8 in PaxosService::_commit (this=0x16a9240) at mon/PaxosService.cc:105 #16 0x000000000047f489 in finish_contexts(std::list<Context*, std::allocator<Context*> >&, int) () #17 0x000000000047bdad in Paxos::handle_accept (this=0x16a9020, accept=0x7fca3292c360) at mon/Paxos.cc:454 #18 0x000000000047e793 in Paxos::dispatch (this=0x16a9020, m=0x7fca3292c360) at mon/Paxos.cc:822 #19 0x000000000046c604 in Monitor::_ms_dispatch (this=<value optimized out>, m=0x7fca3292c360) at mon/Monitor.cc:718 #20 0x0000000000478287 in Monitor::ms_dispatch(Message*) () #21 0x0000000000452b89 in Messenger::ms_deliver_dispatch (this=0x1694e50) at msg/Messenger.h:97 #22 SimpleMessenger::dispatch_entry (this=0x1694e50) at msg/SimpleMessenger.cc:342 #23 0x000000000044877c in SimpleMessenger::DispatchThread::entry (this=0x16952d8) at msg/SimpleMessenger.h:534 #24 0x000000000045bcba in Thread::_entry_func (arg=0xd1f) at ./common/Thread.h:39 ---Type <return> to continue, or q <return> to quit--- #25 0x00007fca3c9049ca in start_thread () from /lib/libpthread.so.0 #26 0x00007fca3bb246cd in clone () from /lib/libc.so.6 #27 0x0000000000000000 in ?? () (gdb)
Then, the backtraces of mon1:
root@node14:~# ls -al /core* -rw------- 1 root root 261062656 2010-07-12 16:58 /core.node14.13946 -rw------- 1 root root 271114240 2010-07-12 16:50 /core.node14.8617 root@node14:~#
Core was generated by `/usr/bin/cmon -i 1 -c /etc/ceph/ceph.conf'. Program terminated with signal 6, Aborted. #0 0x00007f976a292a75 in raise () from /lib/libc.so.6 (gdb) bt #0 0x00007f976a292a75 in raise () from /lib/libc.so.6 #1 0x00007f976a2965c0 in abort () from /lib/libc.so.6 #2 0x00007f976ab488e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007f976ab46d16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007f976ab46d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007f976ab46e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00007f976ab472cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6 #7 0x00007f976ab47389 in operator new[](unsigned long) () from /usr/lib/libstdc++.so.6 #8 0x0000000000452138 in raw_char (this=0x384a6d0, pm=<value optimized out>) at ./include/buffer.h:154 #9 ceph::buffer::create (this=0x384a6d0, pm=<value optimized out>) at ./include/buffer.h:293 #10 SimpleMessenger::Pipe::read_message (this=0x384a6d0, pm=<value optimized out>) at msg/SimpleMessenger.cc:1722 #11 0x0000000000456ac3 in SimpleMessenger::Pipe::reader (this=0x384a6d0) at msg/SimpleMessenger.cc:1450 #12 0x00000000004487cd in SimpleMessenger::Pipe::Reader::entry (this=<value optimized out>) at msg/SimpleMessenger.h:171 #13 0x000000000045bcba in Thread::_entry_func (arg=0x21a9) at ./common/Thread.h:39 #14 0x00007f976b1269ca in start_thread () from /lib/libpthread.so.0 #15 0x00007f976a3456fd in clone () from /lib/libc.so.6 #16 0x0000000000000000 in ?? () (gdb)
The second backtrace:
Core was generated by `/usr/bin/cmon -i 1 -c /etc/ceph/ceph.conf'. Program terminated with signal 6, Aborted. #0 0x00007eff86c09a75 in raise () from /lib/libc.so.6 (gdb) bt #0 0x00007eff86c09a75 in raise () from /lib/libc.so.6 #1 0x00007eff86c0d5c0 in abort () from /lib/libc.so.6 #2 0x00007eff874bf8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007eff874bdd16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007eff874bdd43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007eff874bde3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00007eff874be2cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6 #7 0x00007eff874be389 in operator new[](unsigned long) () from /usr/lib/libstdc++.so.6 #8 0x00000000004e42c2 in raw_char (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>) at ./include/buffer.h:154 #9 ceph::buffer::create (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>) at ./include/buffer.h:293 #10 ptr (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>) at ./include/buffer.h:331 #11 MonitorStore::get_bl_ss (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>) at mon/MonitorStore.cc:217 #12 0x000000000048c4dd in MonitorStore::get_bl_sn (this=<value optimized out>, from=<value optimized out>) at mon/MonitorStore.h:60 #13 OSDMonitor::build_incremental (this=<value optimized out>, from=<value optimized out>) at mon/OSDMonitor.cc:761 #14 0x000000000048cbfb in OSDMonitor::check_sub (this=0x367a, sub=0x7eff7a7af320) at mon/OSDMonitor.cc:808 #15 0x000000000046affc in Monitor::handle_subscribe (this=<value optimized out>, m=<value optimized out>) at mon/Monitor.cc:787 #16 0x000000000046c78f in Monitor::_ms_dispatch (this=<value optimized out>, m=0x7eff838dd660) at mon/Monitor.cc:651 #17 0x0000000000478287 in Monitor::ms_dispatch(Message*) () #18 0x0000000000452b89 in Messenger::ms_deliver_dispatch (this=0x1df5e50) at msg/Messenger.h:97 #19 SimpleMessenger::dispatch_entry (this=0x1df5e50) at msg/SimpleMessenger.cc:342 #20 0x000000000044877c in SimpleMessenger::DispatchThread::entry (this=0x1df62d8) at msg/SimpleMessenger.h:534 #21 0x000000000045bcba in Thread::_entry_func (arg=0x367a) at ./common/Thread.h:39 #22 0x00007eff87a9d9ca in start_thread () from /lib/libpthread.so.0 #23 0x00007eff86cbc6fd in clone () from /lib/libc.so.6 #24 0x0000000000000000 in ?? () (gdb)
As you can see, the traces are different after a restart of the monitor.
These tests were all run with the unstable branch ( 4884f7429cef0a16b7dae263ae24ca3ba6c555eb )
The core files and logfiles from the monitors (which seem to be rotated?) are uploaded at: http://zooi.widodh.nl/ceph/monitor_crash_during_recover.tar.gz