Project

General

Profile

Actions

Bug #9696

closed

Upgrade from 0.80.5 to 0.80.6 causes OSDs to go down with FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

Added by Florian Haas over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
OSD
Target version:
-
% Done:

100%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After an upgrade from 0.80.5 to 0.80.6, almost all OSDs went down after hitting the following failed assertion:

osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

 ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)
 1: (PG::choose_acting(pg_shard_t&)+0x13e5) [0x76b355]
 2: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::
na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x15c) [0x77905c]
 3: (boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_
::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Peering> c
onst&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::sta
techart::null_exception_translator>&)+0x5c) [0x7b014c]
[...]

This assert was only added recently, in https://github.com/ceph/ceph/commit/92cfd370395385ca5537b5bc72220934c9f09026, and subsequently backported to firefly.

The cluster in question has pools with an unusually low number of PGs (8), and uses the osd max backfills default of 10. I'm not 100% sure what that assert does, but could it be that it doesn't correctly account for the possibility of osd_max_backfills > pg_num?

As it stands, the issue is pretty debilitating as it can take out an entire cluster with one run of unattended upgrades, which isn't exactly an insane thing to do for just a point release.


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #9715: assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) fireflyDuplicate10/09/2014

Actions
Actions

Also available in: Atom PDF