Project

General

Profile

Bug #272

Monitor crashes during replication level change

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Today i tried to change my replication level from 2 to 3 on my "data" and "metadata" pools.

This caused various crashes on my cluster (MDS, MON and OSD), but in this issue i'm reporting to monitor crashes i saw during the proces.

I tried to start the MON's again, but they would crash again after a few minutes. During this proces there were no I/O's to the filesystem, but it was in a degraded state (40%) since the level had changed.

The backtraces from mon0

root@node13:~# ls -al /core*
-rw------- 1 root root 252190720 2010-07-12 16:53 /core.node13.1636
-rw------- 1 root root 249643008 2010-07-12 16:56 /core.node13.3359
root@node13:~#
Core was generated by `/usr/bin/cmon -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007ff320b82a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007ff320b82a75 in raise () from /lib/libc.so.6
#1  0x00007ff320b865c0 in abort () from /lib/libc.so.6
#2  0x00007ff3214378e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007ff321435d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007ff321435d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007ff321435e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00007ff3214362cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6
#7  0x00007ff321436389 in operator new[](unsigned long) () from /usr/lib/libstdc++.so.6
#8  0x00000000004e42c2 in raw_char (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, 
    b=<value optimized out>) at ./include/buffer.h:154
#9  ceph::buffer::create (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, 
    b=<value optimized out>) at ./include/buffer.h:293
#10 ptr (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>)
    at ./include/buffer.h:331
#11 MonitorStore::get_bl_ss (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, 
    b=<value optimized out>) at mon/MonitorStore.cc:217
#12 0x000000000048c4dd in MonitorStore::get_bl_sn (this=<value optimized out>, from=<value optimized out>)
    at mon/MonitorStore.h:60
#13 OSDMonitor::build_incremental (this=<value optimized out>, from=<value optimized out>) at mon/OSDMonitor.cc:761
#14 0x000000000048cbfb in OSDMonitor::check_sub (this=0x664, sub=0x7ff31beb2210) at mon/OSDMonitor.cc:808
#15 0x000000000046affc in Monitor::handle_subscribe (this=<value optimized out>, m=<value optimized out>)
    at mon/Monitor.cc:787
#16 0x000000000046c78f in Monitor::_ms_dispatch (this=<value optimized out>, m=0x59ca3e0) at mon/Monitor.cc:651
#17 0x0000000000478287 in Monitor::ms_dispatch(Message*) ()
#18 0x0000000000452b89 in Messenger::ms_deliver_dispatch (this=0x1f58e50) at msg/Messenger.h:97
#19 SimpleMessenger::dispatch_entry (this=0x1f58e50) at msg/SimpleMessenger.cc:342
#20 0x000000000044877c in SimpleMessenger::DispatchThread::entry (this=0x1f592d8) at msg/SimpleMessenger.h:534
#21 0x000000000045bcba in Thread::_entry_func (arg=0x664) at ./common/Thread.h:39
#22 0x00007ff321a159ca in start_thread () from /lib/libpthread.so.0
#23 0x00007ff320c356cd in clone () from /lib/libc.so.6
#24 0x0000000000000000 in ?? ()
(gdb) 

Then, the second backtrace from mon0

Core was generated by `/usr/bin/cmon -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007fca3ba71a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007fca3ba71a75 in raise () from /lib/libc.so.6
#1  0x00007fca3ba755c0 in abort () from /lib/libc.so.6
#2  0x00007fca3c3268e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007fca3c324d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007fca3c324d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007fca3c324e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00007fca3c3252cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6
#7  0x00000000004a0563 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, ceph::buffer::list>, std::_Select1st<std::pair<unsigned int const, ceph::buffer::list> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::_M_insert_(std::_Rb_tree_node_base const*, std::_Rb_tree_node_base const*, std::pair<unsigned int const, ceph::buffer::list> const&) ()
#8  0x00000000004a088b in std::_Rb_tree<unsigned int, std::pair<unsigned int const, ceph::buffer::list>, std::_Select1st<std::pair<unsigned int const, ceph::buffer::list> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::_M_insert_unique_(std::_Rb_tree_const_iterator<std::pair<unsigned int const, ceph::buffer::list> >, std::pair<unsigned int const, ceph::buffer::list> const&) ()
#9  0x000000000048c6d0 in std::map<unsigned int, ceph::buffer::list, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::insert (this=<value optimized out>, from=<value optimized out>)
    at /usr/include/c++/4.4/bits/stl_map.h:540
#10 std::map<unsigned int, ceph::buffer::list, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::operator[] (this=<value optimized out>, from=<value optimized out>)
    at /usr/include/c++/4.4/bits/stl_map.h:450
#11 OSDMonitor::build_incremental (this=<value optimized out>, from=<value optimized out>) at mon/OSDMonitor.cc:763
#12 0x000000000048cbfb in OSDMonitor::check_sub (this=0xd1f, sub=0x7fca2bb571f0) at mon/OSDMonitor.cc:808
#13 0x000000000048cefe in OSDMonitor::check_subs (this=0x16a9240) at mon/OSDMonitor.cc:799
#14 0x000000000048d6dc in OSDMonitor::update_from_paxos (this=0x16a9240) at mon/OSDMonitor.cc:125
#15 0x0000000000482ba8 in PaxosService::_commit (this=0x16a9240) at mon/PaxosService.cc:105
#16 0x000000000047f489 in finish_contexts(std::list<Context*, std::allocator<Context*> >&, int) ()
#17 0x000000000047bdad in Paxos::handle_accept (this=0x16a9020, accept=0x7fca3292c360) at mon/Paxos.cc:454
#18 0x000000000047e793 in Paxos::dispatch (this=0x16a9020, m=0x7fca3292c360) at mon/Paxos.cc:822
#19 0x000000000046c604 in Monitor::_ms_dispatch (this=<value optimized out>, m=0x7fca3292c360) at mon/Monitor.cc:718
#20 0x0000000000478287 in Monitor::ms_dispatch(Message*) ()
#21 0x0000000000452b89 in Messenger::ms_deliver_dispatch (this=0x1694e50) at msg/Messenger.h:97
#22 SimpleMessenger::dispatch_entry (this=0x1694e50) at msg/SimpleMessenger.cc:342
#23 0x000000000044877c in SimpleMessenger::DispatchThread::entry (this=0x16952d8) at msg/SimpleMessenger.h:534
#24 0x000000000045bcba in Thread::_entry_func (arg=0xd1f) at ./common/Thread.h:39
---Type <return> to continue, or q <return> to quit---
#25 0x00007fca3c9049ca in start_thread () from /lib/libpthread.so.0
#26 0x00007fca3bb246cd in clone () from /lib/libc.so.6
#27 0x0000000000000000 in ?? ()
(gdb) 

Then, the backtraces of mon1:

root@node14:~# ls -al /core*
-rw------- 1 root root 261062656 2010-07-12 16:58 /core.node14.13946
-rw------- 1 root root 271114240 2010-07-12 16:50 /core.node14.8617
root@node14:~#
Core was generated by `/usr/bin/cmon -i 1 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f976a292a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007f976a292a75 in raise () from /lib/libc.so.6
#1  0x00007f976a2965c0 in abort () from /lib/libc.so.6
#2  0x00007f976ab488e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007f976ab46d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007f976ab46d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007f976ab46e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00007f976ab472cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6
#7  0x00007f976ab47389 in operator new[](unsigned long) () from /usr/lib/libstdc++.so.6
#8  0x0000000000452138 in raw_char (this=0x384a6d0, pm=<value optimized out>) at ./include/buffer.h:154
#9  ceph::buffer::create (this=0x384a6d0, pm=<value optimized out>) at ./include/buffer.h:293
#10 SimpleMessenger::Pipe::read_message (this=0x384a6d0, pm=<value optimized out>) at msg/SimpleMessenger.cc:1722
#11 0x0000000000456ac3 in SimpleMessenger::Pipe::reader (this=0x384a6d0) at msg/SimpleMessenger.cc:1450
#12 0x00000000004487cd in SimpleMessenger::Pipe::Reader::entry (this=<value optimized out>) at msg/SimpleMessenger.h:171
#13 0x000000000045bcba in Thread::_entry_func (arg=0x21a9) at ./common/Thread.h:39
#14 0x00007f976b1269ca in start_thread () from /lib/libpthread.so.0
#15 0x00007f976a3456fd in clone () from /lib/libc.so.6
#16 0x0000000000000000 in ?? ()
(gdb) 

The second backtrace:

Core was generated by `/usr/bin/cmon -i 1 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007eff86c09a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007eff86c09a75 in raise () from /lib/libc.so.6
#1  0x00007eff86c0d5c0 in abort () from /lib/libc.so.6
#2  0x00007eff874bf8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007eff874bdd16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007eff874bdd43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007eff874bde3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00007eff874be2cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6
#7  0x00007eff874be389 in operator new[](unsigned long) () from /usr/lib/libstdc++.so.6
#8  0x00000000004e42c2 in raw_char (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, 
    b=<value optimized out>) at ./include/buffer.h:154
#9  ceph::buffer::create (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, 
    b=<value optimized out>) at ./include/buffer.h:293
#10 ptr (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, b=<value optimized out>)
    at ./include/buffer.h:331
#11 MonitorStore::get_bl_ss (this=<value optimized out>, bl=<value optimized out>, a=<value optimized out>, 
    b=<value optimized out>) at mon/MonitorStore.cc:217
#12 0x000000000048c4dd in MonitorStore::get_bl_sn (this=<value optimized out>, from=<value optimized out>)
    at mon/MonitorStore.h:60
#13 OSDMonitor::build_incremental (this=<value optimized out>, from=<value optimized out>) at mon/OSDMonitor.cc:761
#14 0x000000000048cbfb in OSDMonitor::check_sub (this=0x367a, sub=0x7eff7a7af320) at mon/OSDMonitor.cc:808
#15 0x000000000046affc in Monitor::handle_subscribe (this=<value optimized out>, m=<value optimized out>)
    at mon/Monitor.cc:787
#16 0x000000000046c78f in Monitor::_ms_dispatch (this=<value optimized out>, m=0x7eff838dd660) at mon/Monitor.cc:651
#17 0x0000000000478287 in Monitor::ms_dispatch(Message*) ()
#18 0x0000000000452b89 in Messenger::ms_deliver_dispatch (this=0x1df5e50) at msg/Messenger.h:97
#19 SimpleMessenger::dispatch_entry (this=0x1df5e50) at msg/SimpleMessenger.cc:342
#20 0x000000000044877c in SimpleMessenger::DispatchThread::entry (this=0x1df62d8) at msg/SimpleMessenger.h:534
#21 0x000000000045bcba in Thread::_entry_func (arg=0x367a) at ./common/Thread.h:39
#22 0x00007eff87a9d9ca in start_thread () from /lib/libpthread.so.0
#23 0x00007eff86cbc6fd in clone () from /lib/libc.so.6
#24 0x0000000000000000 in ?? ()
(gdb) 

As you can see, the traces are different after a restart of the monitor.

These tests were all run with the unstable branch ( 4884f7429cef0a16b7dae263ae24ca3ba6c555eb )

The core files and logfiles from the monitors (which seem to be rotated?) are uploaded at: http://zooi.widodh.nl/ceph/monitor_crash_during_recover.tar.gz

History

#1 Updated by Wido den Hollander over 13 years ago

Just tried to start the monitors again, they both crashed again:

mon0

Core was generated by `/usr/bin/cmon -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007fb8d51e9a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007fb8d51e9a75 in raise () from /lib/libc.so.6
#1  0x00007fb8d51ed5c0 in abort () from /lib/libc.so.6
#2  0x00007fb8d5a9e8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007fb8d5a9cd16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007fb8d5a9cd43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007fb8d5a9ce3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00007fb8d5a9d2cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6
#7  0x00007fb8d5a79d99 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) ()
   from /usr/lib/libstdc++.so.6
#8  0x00007fb8d5a7a76b in std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) ()
   from /usr/lib/libstdc++.so.6
#9  0x00007fb8d5a7b29c in std::string::reserve(unsigned long) () from /usr/lib/libstdc++.so.6
#10 0x00007fb8d5a7b4e8 in std::string::append(char const*, unsigned long) () from /usr/lib/libstdc++.so.6
#11 0x00000000004b275b in std::string::append (state=<value optimized out>) at /usr/include/c++/4.4/bits/basic_string.h:866
#12 std::string::operator+= (state=<value optimized out>) at /usr/include/c++/4.4/bits/basic_string.h:800
#13 pg_state_string (state=<value optimized out>) at ./osd/osd_types.h:502
#14 0x00000000004b998b in PGMonitor::prepare_pg_stats (this=0xfe1120, stats=0x7fb8cc09f8b0) at mon/PGMonitor.cc:384
#15 0x00000000004ba333 in PGMonitor::prepare_update (this=0xfe1120, m=0x7fb8cc09f8b0) at mon/PGMonitor.cc:173
#16 0x0000000000482971 in PaxosService::dispatch (this=0xfe1120, m=0x7fb8cc09f8b0) at mon/PaxosService.cc:70
#17 0x000000000046c300 in Monitor::_ms_dispatch (this=<value optimized out>, m=0x7fb8cc09f8b0) at mon/Monitor.cc:683
#18 0x0000000000478287 in Monitor::ms_dispatch(Message*) ()
#19 0x0000000000452b89 in Messenger::ms_deliver_dispatch (this=0xfcbe50) at msg/Messenger.h:97
#20 SimpleMessenger::dispatch_entry (this=0xfcbe50) at msg/SimpleMessenger.cc:342
#21 0x000000000044877c in SimpleMessenger::DispatchThread::entry (this=0xfcc2d8) at msg/SimpleMessenger.h:534
#22 0x000000000045bcba in Thread::_entry_func (arg=0x3f88) at ./common/Thread.h:39
#23 0x00007fb8d607c9ca in start_thread () from /lib/libpthread.so.0
#24 0x00007fb8d529c6cd in clone () from /lib/libc.so.6
#25 0x0000000000000000 in ?? ()
(gdb) 

mon1
Core was generated by `/usr/bin/cmon -i 1 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0 0x00007f9138f62a75 in raise () from /lib/libc.so.6
(gdb) bt
#0 0x00007f9138f62a75 in raise () from /lib/libc.so.6
#1 0x00007f9138f665c0 in abort () from /lib/libc.so.6
#2 0x00007f91398188e5 in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3 0x00007f9139816d16 in ?? () from /usr/lib/libstdc++.so.6
#4 0x00007f9139816d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5 0x00007f9139816e3e in _cxa_throw () from /usr/lib/libstdc++.so.6
#6 0x00007f91398172cd in operator new(unsigned long) () from /usr/lib/libstdc++.so.6
#7 0x00000000004a0563 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, ceph::buffer::list>, std::_Select1st<std::pair<unsigned int const, ceph::buffer::list> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::_M_insert
(std::_Rb_tree_node_base const*, std::_Rb_tree_node_base const*, std::pair<unsigned int const, ceph::buffer::list> const&) ()
#8 0x00000000004a088b in std::_Rb_tree<unsigned int, std::pair<unsigned int const, ceph::buffer::list>, std::_Select1st<std::pair<unsigned int const, ceph::buffer::list> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::_M_insert_unique_(std::_Rb_tree_const_iterator<std::pair<unsigned int const, ceph::buffer::list> >, std::pair<unsigned int const, ceph::buffer::list> const&) ()
#9 0x000000000048c6d0 in std::map<unsigned int, ceph::buffer::list, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::insert (this=<value optimized out>, from=<value optimized out>)
at /usr/include/c++/4.4/bits/stl_map.h:540
#10 std::map<unsigned int, ceph::buffer::list, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, ceph::buffer::list> > >::operator[] (this=<value optimized out>, from=<value optimized out>)
at /usr/include/c++/4.4/bits/stl_map.h:450
#11 OSDMonitor::build_incremental (this=<value optimized out>, from=<value optimized out>) at mon/OSDMonitor.cc:763
#12 0x000000000048cbfb in OSDMonitor::check_sub (this=0x696d, sub=0x7f91294fb560) at mon/OSDMonitor.cc:808
#13 0x000000000046affc in Monitor::handle_subscribe (this=<value optimized out>, m=<value optimized out>)
at mon/Monitor.cc:787
#14 0x000000000046c78f in Monitor::_ms_dispatch (this=<value optimized out>, m=0x45463d0) at mon/Monitor.cc:651
#15 0x0000000000478287 in Monitor::ms_dispatch(Message*) ()
#16 0x0000000000452b89 in Messenger::ms_deliver_dispatch (this=0x252be50) at msg/Messenger.h:97
#17 SimpleMessenger::dispatch_entry (this=0x252be50) at msg/SimpleMessenger.cc:342
#18 0x000000000044877c in SimpleMessenger::DispatchThread::entry (this=0x252c2d8) at msg/SimpleMessenger.h:534
#19 0x000000000045bcba in Thread::_entry_func (arg=0x696d) at ./common/Thread.h:39
#20 0x00007f9139df69ca in start_thread () from /lib/libpthread.so.0
#21 0x00007f91390156fd in clone () from /lib/libc.so.6
#22 0x0000000000000000 in ?? ()
(gdb)

I uploaded the core files and logfiles to, they can be found at: http://zooi.widodh.nl/ceph/monitor_crash_during_recover.1.tar.gz

#2 Updated by Sage Weil over 13 years ago

It looks like your system generated a bazillion osdmap updates for some reason. And then when osds start up, they tell hte mon they have an old map X, and it tries to catch them up by generating a single enormous message. 851a73412287ca79640d9a371f7c156f63002ad6 limits each message to 100 incrementals, so they can catch up piecemeal. Can you see if that helps?

It may be that the monitor is still too fast reading things off disk and eats up memory before it can send the updates over the wire.. :/

#3 Updated by Sage Weil over 13 years ago

  • Status changed from New to 4

#4 Updated by Sage Weil over 13 years ago

  • Status changed from 4 to Closed

Also available in: Atom PDF