Actions
Bug #9696
closedUpgrade from 0.80.5 to 0.80.6 causes OSDs to go down with FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)
% Done:
100%
Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After an upgrade from 0.80.5 to 0.80.6, almost all OSDs went down after hitting the following failed assertion:
osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae) 1: (PG::choose_acting(pg_shard_t&)+0x13e5) [0x76b355] 2: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list< mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:: na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x15c) [0x77905c] 3: (boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_ ::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Peering> c onst&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::sta techart::null_exception_translator>&)+0x5c) [0x7b014c] [...]
This assert was only added recently, in https://github.com/ceph/ceph/commit/92cfd370395385ca5537b5bc72220934c9f09026, and subsequently backported to firefly.
The cluster in question has pools with an unusually low number of PGs (8), and uses the osd max backfills
default of 10. I'm not 100% sure what that assert does, but could it be that it doesn't correctly account for the possibility of osd_max_backfills
> pg_num
?
As it stands, the issue is pretty debilitating as it can take out an entire cluster with one run of unattended upgrades, which isn't exactly an insane thing to do for just a point release.
Actions