Bug #3699
closedosds crashed in ReplicatedPG::sub_op_modify on a mixed node cluster
0%
Description
cluster: burnupi06 [running osd.1 on v0.55.1] , burnupi07[running osd.3, osd.4, mon.b on argonaut], burnupi08[running osd.5,osd.6,mon.c,mds.a on argonaut]
steps to reproduce:
1. All 3 nodes were on running on argonaut and had IO pumped in to the cluster from client.
2. when IO is in progress, upgraded osd.1 on burnupi06 to v0.55.1 and restarted osd.1
3. Moved osd.1 out of the cluster with the command "ceph osd out 1", which is when all other osds running on argonaut crashed
2012-12-29 14:05:37.654117 7fb605632700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_modify(OpRequestRef)' thread 7fb605632700 time 2012-12-29 14:05:37.571094 osd/ReplicatedPG.cc: 4192: FAILED assert(is_active()) ceph version 0.48.2argonaut-61-g9483a03 (commit:9483a032f750572586f146c696ec6501d3df0383) 1: (ReplicatedPG::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0xbbd) [0x54369d] 2: (ReplicatedPG::do_sub_op(std::tr1::shared_ptr<OpRequest>)+0xff) [0x55711f] 3: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x9f) [0x60b89f] 4: (OSD::dequeue_op(PG*)+0x238) [0x5cab78] 5: (ThreadPool::worker()+0x4c4) [0x7aa554] 6: (ThreadPool::WorkThread::entry()+0xd) [0x5e383d] 7: (()+0x7e9a) [0x7fb6164ace9a] 8: (clone()+0x6d) [0x7fb614a5ecbd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ubuntu@burnupi07:/var/log/ceph$ sudo cat /etc/ceph/ceph.conf [global] auth client required = none auth cluster required = none auth service required = none [osd] osd journal size = 1000 filestore xattr use omap = true [osd.1] host = burnupi06 [osd.3] host = burnupi07 [osd.4] host = burnupi07 [osd.5] host = burnupi08 [osd.6] host = burnupi08 [mon.b] host = burnupi07 mon addr = 10.214.134.38:6789 [mon.c] host = burnupi08 mon addr = 10.214.134.36:6789 [mds.a] host = burnupi08 ubuntu@burnupi06:/var/log/ceph$ sudo cat /etc/ceph/ceph.conf [global] auth client required = none auth cluster required = none auth service required = none [osd] osd journal size = 1000 filestore xattr use omap = true [osd.1] osd min pg log entries = 10 host = burnupi06 [osd.3] host = burnupi07 [osd.4] host = burnupi07 [osd.5] host = burnupi08 [osd.6] host = burnupi08 [mon.b] host = burnupi07 mon addr = 10.214.134.38:6789 [mon.c] host = burnupi08 mon addr = 10.214.134.36:6789 [mds.a] host = burnupi08
leaving the cluster as it is for reference.
Updated by Sage Weil over 11 years ago
- Category set to OSD
- Priority changed from Normal to Urgent
Updated by Tamilarasi muthamizhan over 11 years ago
bringing back the marked out osd.1 in on burnupi06 while running the io hit the following,
2012-12-31 14:26:26.667816 7f6334ca6700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f6334ca6700 time 2012-12-31 14:26:26.449073
osd/PG.cc: 4891: FAILED assert(0 == "we got a bad state machine event")
ceph version 0.55.1 (8e25c8d984f9258644389a18997ec6bdef8e056b)
1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xab) [0x68031b]
2: /usr/bin/ceph-osd() [0x6abcb6]
3: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list2<boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &(boost::statechart::detail::no_context<boost::statechart::event_base>::no_function(boost::statechart::event_base const&))> >, boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0xb3) [0x6c9ee3]
4: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x6ca024]
5: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb9) [0x6cacf9]
6: (boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x6cda88]
7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x6b34fb]
8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x6b37d1]
9: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x347) [0x6751c7]
10: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&)+0x24a) [0x62280a]
11: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&)+0x10) [0x656350]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x81a13c]
13: (ThreadPool::WorkThread::entry()+0x10) [0x81bf40]
14: (()+0x7e9a) [0x7f6345538e9a]
15: (clone()+0x6d) [0x7f63438d1cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
the cluster is left in the same state.
Updated by Tamilarasi muthamizhan over 11 years ago
- Assignee changed from Tamilarasi muthamizhan to Samuel Just
reproduced this on burnupi21.
Updated by Sage Weil over 11 years ago
- Status changed from New to Resolved