Project

General

Profile

Actions

Bug #4572

closed

osd crash with: 0 == "we got a bad state machine event"

Added by Wido den Hollander about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This seems like #4042 but the backtrace seems different.

After resolving #4556 I tried to recover the cluster, but in the end 18 out of the 40 OSDs survived and were running.

The other 22 seem to have crash with almost similair backtraces.

For example osd.2:

(gdb) bt
#0  0x00007f2fa5229b7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000000000078910e in reraise_fatal (signum=6) at global/signal_handler.cc:58
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3  <signal handler called>
#4  0x00007f2fa3be8425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f2fa3bebb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f2fa453a69d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f2fa4538846 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f2fa4538873 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f2fa453896e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000008343af in ceph::__ceph_assert_fail (assertion=0x9123b0 "0 == \"we got a bad state machine event\"", file=<optimized out>, line=5250, 
    func=0x916ca0 "PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)") at common/assert.cc:77
#11 0x000000000068866b in PG::RecoveryState::Crashed::Crashed (this=0x37b59b0, ctx=...) at osd/PG.cc:5250
#12 0x00000000006b4496 in shallow_construct (outermostContextBase=..., pContext=<optimized out>) at /usr/include/boost/statechart/state.hpp:89
#13 deep_construct (outermostContextBase=..., pContext=<optimized out>) at /usr/include/boost/statechart/state.hpp:79
#14 boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PG::RecoveryState::Crashed, boost::mpl::l_end>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator> >::construct (outermostContextBase=..., pContext=<optimized out>)
    at /usr/include/boost/statechart/detail/constructor.hpp:93
#15 0x00000000006d2643 in transit_impl<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function> (this=0x37b59b0, transitionAction=...)
    at /usr/include/boost/statechart/simple_state.hpp:798
#16 transit<PG::RecoveryState::Crashed> (this=0x37b59b0) at /usr/include/boost/statechart/simple_state.hpp:314
#17 react_without_action (stt=...) at /usr/include/boost/statechart/transition.hpp:38
#18 react (stt=...) at /usr/include/boost/statechart/detail/reaction_dispatcher.hpp:47
#19 react (stt=..., evt=...) at /usr/include/boost/statechart/detail/reaction_dispatcher.hpp:68
#20 react (stt=..., evt=..., eventType=<optimized out>) at /usr/include/boost/statechart/detail/reaction_dispatcher.hpp:109
#21 react<PG::RecoveryState::Reset, boost::statechart::event_base, void const*> (stt=..., evt=..., eventType=<optimized out>) at /usr/include/boost/statechart/transition.hpp:59
#22 local_react_impl<boost::mpl::list1<boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=<optimized out>) at /usr/include/boost/statechart/simple_state.hpp:816
#23 local_react<boost::mpl::list1<boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (this=0x37b59b0, evt=..., eventType=<optimized out>)
    at /usr/include/boost/statechart/simple_state.hpp:851
#24 boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list2<boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=<optimized out>) at /usr/include/boost/statechart/simple_state.hpp:820
#25 0x00000000006d2794 in local_react<boost::mpl::list2<boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (
    eventType=0xc2b5c0, evt=..., this=0x37b59b0) at /usr/include/boost/statechart/simple_state.hpp:851
#26 local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (eventType=0xc2b5c0, evt=..., 
    stt=...) at /usr/include/boost/statechart/simple_state.hpp:820
#27 local_react<boost::mpl::list3<boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (eventType=0xc2b5c0, evt=..., this=0x37b59b0) at /usr/include/boost/statechart/simple_state.hpp:851
#28 local_react_impl<boost::mpl::list4<boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (stt=..., eventType=0xc2b5c0, evt=...) at /usr/include/boost/statechart/simple_state.hpp:820
#29 local_react<boost::mpl::list4<boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (eventType=0xc2b5c0, evt=..., this=0x37b59b0) at /usr/include/boost/statechart/simple_state.hpp:851
#30 boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &boost::statechart::detail::no_context<boost::st---Type <return> to continue, or q <return> to quit---
atechart::event_base>::no_function> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (stt=..., evt=..., 
    eventType=0xc2b5c0) at /usr/include/boost/statechart/simple_state.hpp:820
#31 0x00000000006d28ce in local_react<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (eventType=0xc2b5c0, evt=..., this=0x37b59b0)
    at /usr/include/boost/statechart/simple_state.hpp:851
#32 local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (stt=..., eventType=0xc2b5c0, evt=...)
    at /usr/include/boost/statechart/simple_state.hpp:820
#33 local_react<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (eventType=0xc2b5c0, evt=..., 
    this=0x37b59b0) at /usr/include/boost/statechart/simple_state.hpp:851
#34 boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl (this=0x37b59b0, evt=..., eventType=0xc2b5c0)
    at /usr/include/boost/statechart/simple_state.hpp:489
#35 0x00000000006bb03b in operator() (this=<synthetic pointer>) at /usr/include/boost/statechart/state_machine.hpp:87
#36 operator()<boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<void>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, const void*>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial>::exception_event_handler> (action=..., this=<optimized out>)
    at /usr/include/boost/statechart/null_exception_translator.hpp:33
#37 boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event (this=0x37c6768, 
    evt=...) at /usr/include/boost/statechart/state_machine.hpp:885
#38 0x00000000006bb311 in boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event (this=0x37c6768, evt=...) at /usr/include/boost/statechart/state_machine.hpp:275

#39 0x000000000067b3f7 in handle_event (rctx=0x7f2f93c008e0, evt=<error reading variable: access outside bounds of object referenced via synthetic pointer>, this=0x37c6768) at osd/PG.h:1717
#40 PG::handle_peering_event (this=0x37c5400, evt=..., rctx=0x7f2f93c008e0) at osd/PG.cc:5114
#41 0x0000000000625118 in OSD::process_peering_events (this=0x2ffc000, pgs=..., handle=...) at osd/OSD.cc:6230
#42 0x000000000065b5d0 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=..., handle=...) at osd/OSD.h:748
#43 0x00000000008297e6 in ThreadPool::worker (this=0x2ffc458, wt=0x9e67c20) at common/WorkQueue.cc:119
#44 0x000000000082b610 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:316
#45 0x00007f2fa5221e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#46 0x00007f2fa3ca5cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#47 0x0000000000000000 in ?? ()
(gdb)

I'm running 0.56.4 and it seems the fix for #4202 went into the bobtail branch, so that shouldn't be the case.

I'll upload the logs of osd.2 and osd.7 to the cephdrop sftp account since they are both quite big (~150MB each).

I tried to start the OSDs and some of them survived. By starting them host by host I'm now at 21/40 and I'll continue to start OSDs.

Actions #1

Updated by Wido den Hollander about 11 years ago

This cluster is kind of stable again:

osdmap e27492: 40 osds: 36 up, 36 in
pgmap v2604024: 7696 pgs: 7569 active+clean, 126 down+peering, 1 active+clean+inconsistent; 35041 MB data, 151 GB used, 64183 GB / 67068 GB avail
Two OSDs are out/down due to a broken disk, but two others fail to start with the backtrace posted above.
  • osd.2
  • osd.37
Actions #2

Updated by Sage Weil about 11 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Sage Weil about 11 years ago

  • Assignee set to Sage Weil
Actions #4

Updated by Sage Weil about 11 years ago

  • Assignee deleted (Sage Weil)

-175> 2013-03-28 17:21:36.478885 7f2f93c01700 10 osd.2 pg_epoch: 26482 pg[0.1f0( empty local-les=0 n=0 ec=1 les/c 7489/7513 26463/26463/25459) [12,3,35] r=-1 lpr=26463 pi=6609-26462/200 inactive NOTIFY] handle_peering_event: epoch_sent: 26470 epoch_requested: 26470 MQuery from 12 query_epoch 26470 query: query(info 0'0)

when the last known state in teh log is

2013-03-28 17:21:24.277422 7f2f93c01700 10 osd.2 pg_epoch: 26482 pg[0.1f0( empty local-les=0 n=0 ec=1 les/c 7489/7513 26463/26463/25459) [12,3,35] r=-1 lpr=26463 pi=6609-26462/200 inactive NOTIFY] state<Reset>: Reset advmap

Actions #5

Updated by Samuel Just about 11 years ago

It's probably due to the is_booting() check in advance_pg().

Actions #6

Updated by Samuel Just about 11 years ago

  • Status changed from New to Resolved

Fix merge into the bobtail branch. Wido, can you test that the current bobtail branch resolves the issue?

Actions

Also available in: Atom PDF