Bug #3699: osds crashed in ReplicatedPG::sub_op_modify on a mixed node cluster - Ceph - Ceph

Actions

Copy link

Bug #3699

closed

osds crashed in ReplicatedPG::sub_op_modify on a mixed node cluster

Added by Tamilarasi muthamizhan over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

cluster: burnupi06 [running osd.1 on v0.55.1] , burnupi07[running osd.3, osd.4, mon.b on argonaut], burnupi08[running osd.5,osd.6,mon.c,mds.a on argonaut]

steps to reproduce:
1. All 3 nodes were on running on argonaut and had IO pumped in to the cluster from client.
2. when IO is in progress, upgraded osd.1 on burnupi06 to v0.55.1 and restarted osd.1
3. Moved osd.1 out of the cluster with the command "ceph osd out 1", which is when all other osds running on argonaut crashed

2012-12-29 14:05:37.654117 7fb605632700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_modify(OpRequestRef)' thread 7fb605632700 time 2012-12-29 14:05:37.571094
osd/ReplicatedPG.cc: 4192: FAILED assert(is_active())

 ceph version 0.48.2argonaut-61-g9483a03 (commit:9483a032f750572586f146c696ec6501d3df0383)
 1: (ReplicatedPG::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0xbbd) [0x54369d]
 2: (ReplicatedPG::do_sub_op(std::tr1::shared_ptr<OpRequest>)+0xff) [0x55711f]
 3: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x9f) [0x60b89f]
 4: (OSD::dequeue_op(PG*)+0x238) [0x5cab78]
 5: (ThreadPool::worker()+0x4c4) [0x7aa554]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x5e383d]
 7: (()+0x7e9a) [0x7fb6164ace9a]
 8: (clone()+0x6d) [0x7fb614a5ecbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@burnupi07:/var/log/ceph$ sudo cat /etc/ceph/ceph.conf 
[global]
    auth client required = none
    auth cluster required = none
    auth service required = none

[osd]
    osd journal size = 1000
    filestore xattr use omap = true

[osd.1]
    host = burnupi06

[osd.3]
    host = burnupi07

[osd.4]
    host = burnupi07

[osd.5]
    host = burnupi08

[osd.6]
    host = burnupi08

[mon.b]
    host = burnupi07
    mon addr = 10.214.134.38:6789

[mon.c]
    host = burnupi08
    mon addr = 10.214.134.36:6789

[mds.a]
    host = burnupi08

ubuntu@burnupi06:/var/log/ceph$ sudo cat /etc/ceph/ceph.conf
[global]
    auth client required = none
    auth cluster required = none
    auth service required = none

[osd]
    osd journal size = 1000
    filestore xattr use omap = true

[osd.1]
    osd min pg log entries = 10
    host = burnupi06

[osd.3]
    host = burnupi07

[osd.4]
    host = burnupi07

[osd.5]
    host = burnupi08

[osd.6]
    host = burnupi08

[mon.b]
    host = burnupi07
    mon addr = 10.214.134.38:6789

[mon.c]
    host = burnupi08
    mon addr = 10.214.134.36:6789

[mds.a]
    host = burnupi08

leaving the cluster as it is for reference.

Actions

Copy link

Updated by Sage Weil over 11 years ago

Category set to OSD
Priority changed from Normal to Urgent

Actions

Copy link

Updated by Tamilarasi muthamizhan over 11 years ago

bringing back the marked out osd.1 in on burnupi06 while running the io hit the following,

2012-12-31 14:26:26.667816 7f6334ca6700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f6334ca6700 time 2012-12-31 14:26:26.449073
osd/PG.cc: 4891: FAILED assert(0 == "we got a bad state machine event")

ceph version 0.55.1 (8e25c8d984f9258644389a18997ec6bdef8e056b)
 1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state&lt;PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::my_context)+0xab) [0x68031b]
 2: /usr/bin/ceph-osd() [0x6abcb6]
 3: (boost::statechart::detail::reaction_result boost::statechart::simple_state&lt;PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0&gt;::local_react_impl_non_empty::local_react_impl&lt;boost::mpl::list2&lt;boost::statechart::custom_reaction&lt;PG::FlushedEvt&gt;, boost::statechart::transition&lt;boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context&lt;boost::statechart::event_base&gt;, &(boost::statechart::detail::no_context&lt;boost::statechart::event_base&gt;::no_function(boost::statechart::event_base const&))> >, boost::statechart::simple_state&lt;PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0&gt; >(boost::statechart::simple_state&lt;PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0&gt;&, boost::statechart::event_base const&, void const*)+0xb3) [0x6c9ee3]
 4: (boost::statechart::simple_state&lt;PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0&gt;::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x6ca024]
 5: (boost::statechart::simple_state&lt;PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0&gt;::react_impl(boost::statechart::event_base const&, void const*)+0xb9) [0x6cacf9]
 6: (boost::statechart::simple_state&lt;PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x6cda88]
 7: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x6b34fb]
 8: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x6b37d1]
 9: (PG::handle_peering_event(std::tr1::shared_ptr&lt;PG::CephPeeringEvt&gt;, PG::RecoveryCtx*)+0x347) [0x6751c7]
 10: (OSD::process_peering_events(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&)+0x24a) [0x62280a]
 11: (OSD::PeeringWQ::_process(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&)+0x10) [0x656350]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x81a13c]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x81bf40]
 14: (()+0x7e9a) [0x7f6345538e9a]
 15: (clone()+0x6d) [0x7f63438d1cbd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.