Bug #10242
closedFAILED assert(backfill_targets.empty() || backfill_targets == want_backfill)
0%
Description
version: 0.80.7
config:
osd max backfills = 1
osd recovery max active = 1
One osd went down after hitting the following failed assertion
2014-12-04 12:42:42.699457 7f5bdc28b700 -1 osd/PG.cc: In function 'bool PG::choose_acting(pg_shard_t&)' thread 7f5bdc28b700 time 2014-12-04 12:42:42.545712
osd/PG.cc: 1327: FAILED assert(backfill_targets.empty() || backfill_targets == want_backfill)
ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
1: (PG::choose_acting(pg_shard_t&)+0x1380) [0x774380]
2: (PG::RecoveryState::Recovered::Recovered(boost::statechart::state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x269) [0x7974a9]
3: (boost::statechart::state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Active> const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x5c) [0x7b98cc]
4: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Backfilling, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Recovered, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0x88) [0x7c1368]
5: (boost::statechart::simple_state<PG::RecoveryState::Backfilling, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x7c14d8]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7a4d6b]
7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x7a50c1]
8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x75b633]
9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2ce) [0x66713e]
10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x6b6472]
11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xa9b036]
12: (ThreadPool::WorkThread::entry()+0x10) [0xa9d050]
13: (()+0x7e9a) [0x7f5bf54dde9a]
14: (clone()+0x6d) [0x7f5bf3ceb3fd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by Wang Qiang over 9 years ago
Since the osd went down, so please upgrade the priority and severity.
Updated by Samuel Just over 9 years ago
Mmm, that assert is essentially saying that choose_acting is only called in two situations:
1) On a new interval. In this case, backfill_targets is empty.
2) After completing recovery and backfill (as happened here). In this case, either the acting set does not change in which case backfill_targets must be empty, or backfill_targets was non empty, in which case we exited the function above to wait for an acting set change from the mons.
There isn't enough information in the log to determine what happened. Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
on the crashing osd?
Updated by Wang Qiang over 9 years ago
Since this is from online product environment, it never happened again. And I cannot reproduce it in my test/staging environment. Currently the all information I can provide are already in attachment.
Although it does not occur often and I can start the osd immediately to workaround. But it cause the osd down, and still trigger recovery. I think this still a problem.
Is there graceful solution to automatically restart the osd in this case?
Updated by Sage Weil over 9 years ago
- Status changed from New to Need More Info
- Priority changed from Urgent to High
upstart will automatically restart the daemon. sysvinit will not. soon systemd will (hammer).
Updated by Samuel Just over 9 years ago
- Status changed from Need More Info to Can't reproduce