Bug #9696: Upgrade from 0.80.5 to 0.80.6 causes OSDs to go down with FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) - Ceph - Ceph

Actions

Copy link

Bug #9696

closed

Upgrade from 0.80.5 to 0.80.6 causes OSDs to go down with FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

Added by Florian Haas over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

100%

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After an upgrade from 0.80.5 to 0.80.6, almost all OSDs went down after hitting the following failed assertion:

osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

 ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)
 1: (PG::choose_acting(pg_shard_t&)+0x13e5) [0x76b355]
 2: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::
na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x15c) [0x77905c]
 3: (boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_
::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Peering> c
onst&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::sta
techart::null_exception_translator>&)+0x5c) [0x7b014c]
[...]

This assert was only added recently, in https://github.com/ceph/ceph/commit/92cfd370395385ca5537b5bc72220934c9f09026, and subsequently backported to firefly.

The cluster in question has pools with an unusually low number of PGs (8), and uses the osd max backfills default of 10. I'm not 100% sure what that assert does, but could it be that it doesn't correctly account for the possibility of osd_max_backfills > pg_num?

As it stands, the issue is pretty debilitating as it can take out an entire cluster with one run of unattended upgrades, which isn't exactly an insane thing to do for just a point release.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #9696

Upgrade from 0.80.5 to 0.80.6 causes OSDs to go down with FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Ian Colle over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Florian Haas over 9 years ago