Project

General

Profile

Actions

Bug #10242

closed

FAILED assert(backfill_targets.empty() || backfill_targets == want_backfill)

Added by Wang Qiang over 9 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

version: 0.80.7

config:
osd max backfills = 1
osd recovery max active = 1

One osd went down after hitting the following failed assertion

2014-12-04 12:42:42.699457 7f5bdc28b700 -1 osd/PG.cc: In function 'bool PG::choose_acting(pg_shard_t&)' thread 7f5bdc28b700 time 2014-12-04 12:42:42.545712
osd/PG.cc: 1327: FAILED assert(backfill_targets.empty() || backfill_targets == want_backfill)

ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
1: (PG::choose_acting(pg_shard_t&)+0x1380) [0x774380]
2: (PG::RecoveryState::Recovered::Recovered(boost::statechart::state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x269) [0x7974a9]
3: (boost::statechart::state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Active> const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x5c) [0x7b98cc]
4: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Backfilling, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Recovered, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0x88) [0x7c1368]
5: (boost::statechart::simple_state<PG::RecoveryState::Backfilling, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x7c14d8]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7a4d6b]
7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x7a50c1]
8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x75b633]
9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2ce) [0x66713e]
10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x6b6472]
11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xa9b036]
12: (ThreadPool::WorkThread::entry()+0x10) [0xa9d050]
13: (()+0x7e9a) [0x7f5bf54dde9a]
14: (clone()+0x6d) [0x7f5bf3ceb3fd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Files

ceph.log.tgz (538 KB) ceph.log.tgz ceph.log and ceph-osd.log Wang Qiang, 12/04/2014 12:19 AM
Actions #1

Updated by Wang Qiang over 9 years ago

Since the osd went down, so please upgrade the priority and severity.

Actions #2

Updated by Sage Weil over 9 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Samuel Just over 9 years ago

Mmm, that assert is essentially saying that choose_acting is only called in two situations:
1) On a new interval. In this case, backfill_targets is empty.
2) After completing recovery and backfill (as happened here). In this case, either the acting set does not change in which case backfill_targets must be empty, or backfill_targets was non empty, in which case we exited the function above to wait for an acting set change from the mons.

There isn't enough information in the log to determine what happened. Can you reproduce with

debug osd = 20
debug filestore = 20
debug ms = 1

on the crashing osd?

Actions #4

Updated by Wang Qiang over 9 years ago

Since this is from online product environment, it never happened again. And I cannot reproduce it in my test/staging environment. Currently the all information I can provide are already in attachment.

Although it does not occur often and I can start the osd immediately to workaround. But it cause the osd down, and still trigger recovery. I think this still a problem.

Is there graceful solution to automatically restart the osd in this case?

Actions #5

Updated by Sage Weil over 9 years ago

  • Status changed from New to Need More Info
  • Priority changed from Urgent to High

upstart will automatically restart the daemon. sysvinit will not. soon systemd will (hammer).

Actions #6

Updated by Samuel Just over 9 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF