Project

General

Profile

Actions

Bug #3904

closed

FAILED assert(want_acting.empty())

Added by Faidon Liambotis over 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
bobtail
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ceph 0.56.1 on Ubuntu 12.04, standard ceph.com packages. Multiple OSDs started getting marked down/crashing out, this crash may or may not be related to the rest:

2013-01-23 19:35:27.081565 7fecac4a0700 -1 osd/PG.cc: In function 'bool PG::choose_acting(int&)' thread 7fecac4a0700 time 2013-01-23 19:35:27.056824
osd/PG.cc: 1269: FAILED assert(want_acting.empty())

 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (PG::choose_acting(int&)+0x435) [0x67ace5]
 2: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x12b) [0x6971eb]
 3: (boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Peering> const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x5c) [0x6cb8cc]
 4: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::GetLog, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0x88) [0x6cbad8]
 5: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x182) [0x6cbd12]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xfb) [0x6b23fb]
 7: (PG::RecoveryState::handle_event(boost::statechart::event_base const&, PG::RecoveryCtx*)+0x57) [0x6b2677]
 8: (PG::handle_activate_map(PG::RecoveryCtx*)+0xe4) [0x673554]
 9: (OSD::advance_pg(unsigned int, PG*, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x475) [0x61d145]
 10: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&)+0x228) [0x61d508]
 11: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&)+0x10) [0x6538d0]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x81d5dc]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x81f3e0]
 14: (()+0x7e9a) [0x7fecbd2bbe9a]
 15: (clone()+0x6d) [0x7fecbbd3fcbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

Not much info besides those I'm afraid :(

Actions #1

Updated by Sage Weil over 11 years ago

I have a theory:

reset
started
primary
getinfo
got infos
getlog
calc_acting succeeds, choose_acting fails,
want_acting = something
WaitActingChange

later, AdvMap, want_acting member now down, Reset

Reset
not a new interval, so no call to start_peering_interval()
start, primary,
getinfo already has infos, goes straight to GetLog
and here, for reasons i can't explain, calc_acting() fails, when it succeeded before. :/

Actions #2

Updated by Sage Weil over 11 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Ian Colle about 11 years ago

  • Assignee set to Samuel Just
Actions #4

Updated by Samuel Just about 11 years ago

  • Priority changed from High to Urgent
Actions #5

Updated by Samuel Just about 11 years ago

  • Status changed from New to Fix Under Review

Sage's scenario is most likely correct, pushed wip_3904.

Actions #6

Updated by Samuel Just about 11 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to bobtail
Actions #7

Updated by Samuel Just about 11 years ago

  • Priority changed from Urgent to High
Actions #8

Updated by Samuel Just almost 11 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF