Project

General

Profile

Actions

Bug #3699

closed

osds crashed in ReplicatedPG::sub_op_modify on a mixed node cluster

Added by Tamilarasi muthamizhan over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

cluster: burnupi06 [running osd.1 on v0.55.1] , burnupi07[running osd.3, osd.4, mon.b on argonaut], burnupi08[running osd.5,osd.6,mon.c,mds.a on argonaut]

steps to reproduce:
1. All 3 nodes were on running on argonaut and had IO pumped in to the cluster from client.
2. when IO is in progress, upgraded osd.1 on burnupi06 to v0.55.1 and restarted osd.1
3. Moved osd.1 out of the cluster with the command "ceph osd out 1", which is when all other osds running on argonaut crashed

2012-12-29 14:05:37.654117 7fb605632700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_modify(OpRequestRef)' thread 7fb605632700 time 2012-12-29 14:05:37.571094
osd/ReplicatedPG.cc: 4192: FAILED assert(is_active())

 ceph version 0.48.2argonaut-61-g9483a03 (commit:9483a032f750572586f146c696ec6501d3df0383)
 1: (ReplicatedPG::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0xbbd) [0x54369d]
 2: (ReplicatedPG::do_sub_op(std::tr1::shared_ptr<OpRequest>)+0xff) [0x55711f]
 3: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x9f) [0x60b89f]
 4: (OSD::dequeue_op(PG*)+0x238) [0x5cab78]
 5: (ThreadPool::worker()+0x4c4) [0x7aa554]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x5e383d]
 7: (()+0x7e9a) [0x7fb6164ace9a]
 8: (clone()+0x6d) [0x7fb614a5ecbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@burnupi07:/var/log/ceph$ sudo cat /etc/ceph/ceph.conf 
[global]
    auth client required = none
    auth cluster required = none
    auth service required = none

[osd]
    osd journal size = 1000
    filestore xattr use omap = true

[osd.1]
    host = burnupi06

[osd.3]
    host = burnupi07

[osd.4]
    host = burnupi07

[osd.5]
    host = burnupi08

[osd.6]
    host = burnupi08

[mon.b]
    host = burnupi07
    mon addr = 10.214.134.38:6789

[mon.c]
    host = burnupi08
    mon addr = 10.214.134.36:6789

[mds.a]
    host = burnupi08

ubuntu@burnupi06:/var/log/ceph$ sudo cat /etc/ceph/ceph.conf
[global]
    auth client required = none
    auth cluster required = none
    auth service required = none

[osd]
    osd journal size = 1000
    filestore xattr use omap = true

[osd.1]
    osd min pg log entries = 10
    host = burnupi06

[osd.3]
    host = burnupi07

[osd.4]
    host = burnupi07

[osd.5]
    host = burnupi08

[osd.6]
    host = burnupi08

[mon.b]
    host = burnupi07
    mon addr = 10.214.134.38:6789

[mon.c]
    host = burnupi08
    mon addr = 10.214.134.36:6789

[mds.a]
    host = burnupi08

leaving the cluster as it is for reference.

Actions #1

Updated by Sage Weil over 11 years ago

  • Category set to OSD
  • Priority changed from Normal to Urgent
Actions #2

Updated by Tamilarasi muthamizhan over 11 years ago

bringing back the marked out osd.1 in on burnupi06 while running the io hit the following,

2012-12-31 14:26:26.667816 7f6334ca6700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f6334ca6700 time 2012-12-31 14:26:26.449073
osd/PG.cc: 4891: FAILED assert(0 == "we got a bad state machine event")

ceph version 0.55.1 (8e25c8d984f9258644389a18997ec6bdef8e056b)
1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state&lt;PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::my_context)+0xab) [0x68031b]
2: /usr/bin/ceph-osd() [0x6abcb6]
3: (boost::statechart::detail::reaction_result boost::statechart::simple_state&lt;PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0&gt;::local_react_impl_non_empty::local_react_impl&lt;boost::mpl::list2&lt;boost::statechart::custom_reaction&lt;PG::FlushedEvt&gt;, boost::statechart::transition&lt;boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context&lt;boost::statechart::event_base&gt;, &(boost::statechart::detail::no_context&lt;boost::statechart::event_base&gt;::no_function(boost::statechart::event_base const&))> >, boost::statechart::simple_state&lt;PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0&gt; >(boost::statechart::simple_state&lt;PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0&gt;&, boost::statechart::event_base const&, void const*)+0xb3) [0x6c9ee3]
4: (boost::statechart::simple_state&lt;PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0&gt;::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x6ca024]
5: (boost::statechart::simple_state&lt;PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0&gt;::react_impl(boost::statechart::event_base const&, void const*)+0xb9) [0x6cacf9]
6: (boost::statechart::simple_state&lt;PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x6cda88]
7: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x6b34fb]
8: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x6b37d1]
9: (PG::handle_peering_event(std::tr1::shared_ptr&lt;PG::CephPeeringEvt&gt;, PG::RecoveryCtx*)+0x347) [0x6751c7]
10: (OSD::process_peering_events(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&)+0x24a) [0x62280a]
11: (OSD::PeeringWQ::_process(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&)+0x10) [0x656350]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x81a13c]
13: (ThreadPool::WorkThread::entry()+0x10) [0x81bf40]
14: (()+0x7e9a) [0x7f6345538e9a]
15: (clone()+0x6d) [0x7f63438d1cbd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---

the cluster is left in the same state.

Actions #3

Updated by Ian Colle over 11 years ago

  • Assignee set to Tamilarasi muthamizhan
Actions #4

Updated by Tamilarasi muthamizhan over 11 years ago

  • Assignee changed from Tamilarasi muthamizhan to Samuel Just

reproduced this on burnupi21.

Actions #5

Updated by Sage Weil over 11 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF