Project

General

Profile

Actions

Bug #9696

closed

Upgrade from 0.80.5 to 0.80.6 causes OSDs to go down with FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

Added by Florian Haas over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
OSD
Target version:
-
% Done:

100%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After an upgrade from 0.80.5 to 0.80.6, almost all OSDs went down after hitting the following failed assertion:

osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

 ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)
 1: (PG::choose_acting(pg_shard_t&)+0x13e5) [0x76b355]
 2: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::
na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x15c) [0x77905c]
 3: (boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_
::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Peering> c
onst&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::sta
techart::null_exception_translator>&)+0x5c) [0x7b014c]
[...]

This assert was only added recently, in https://github.com/ceph/ceph/commit/92cfd370395385ca5537b5bc72220934c9f09026, and subsequently backported to firefly.

The cluster in question has pools with an unusually low number of PGs (8), and uses the osd max backfills default of 10. I'm not 100% sure what that assert does, but could it be that it doesn't correctly account for the possibility of osd_max_backfills > pg_num?

As it stands, the issue is pretty debilitating as it can take out an entire cluster with one run of unattended upgrades, which isn't exactly an insane thing to do for just a point release.


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #9715: assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) fireflyDuplicate10/09/2014

Actions
Actions #1

Updated by Loïc Dachary over 9 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Loïc Dachary over 9 years ago

For the record http://tracker.ceph.com/issues/9715 hits the same assert in similar conditions in teuthology and the full logs are available.

Actions #3

Updated by Florian Haas over 9 years ago

Whoa, wait -- Loïc, are you saying this actually failed a test case and still made it into a release in a stable version?

Actions #4

Updated by Ian Colle over 9 years ago

It actually failed a new test case AFTER it went out into a stable release version.

Actions #5

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to Fix Under Review
Actions #7

Updated by Loïc Dachary over 9 years ago

running in gitbuilder under the branch wip-9696-compat-acting

Actions #8

Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to 12
  • Assignee set to Samuel Just
  • Priority changed from Normal to Immediate
Actions #9

Updated by Loïc Dachary over 9 years ago

  • Status changed from 12 to Fix Under Review
  • Assignee deleted (Samuel Just)
  • Priority changed from Immediate to Normal
Actions #10

Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to 12
  • Assignee set to Samuel Just
  • Priority changed from Normal to Immediate
Actions #11

Updated by Samuel Just over 9 years ago

Can you restart one of the crashing osds with

debug osd = 20
debug filestore = 20
debug ms = 1 ?

As far as we can tell, this bug requires an older osd somewhere in the cluster (older than firefly) to trigger a compatibility mode.

Actions #12

Updated by Samuel Just over 9 years ago

  • Status changed from 12 to Fix Under Review
Actions #14

Updated by Samuel Just over 9 years ago

wip-9696-firefly removes the assert on firefly, it's not valid for the compat case.

Actions #15

Updated by Florian Haas over 9 years ago

Sam, I can confirm with certainty that this did not happen during an upgrade from dumpling. All nodes were running 0.80.5 prior to the upgrade.

Actions #16

Updated by Florian Haas over 9 years ago

Also, could either Loïc or Sam explain what exact combination of circumstances causes this assert to trigger? I can't believe this would blow up every cluster out there no matter what, as we'd have heard about it from more users if that were the case. So it would be good if we were able to tell users, if X and Y and Z, DO NOT upgrade until 0.80.7 is out.

Actions #17

Updated by Samuel Just over 9 years ago

Ok, can you reproduce with the logging above?

Actions #18

Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #19

Updated by Yuri Weinstein over 9 years ago

I guess it's expected as backport is still pending.
Update:

In the run http://pulpito.front.sepia.ceph.com/teuthology-2014-10-10_19:00:01-upgrade:dumpling-x-firefly-distro-basic-multi/

Jobs '537891', '537900', '537901', '537908' failed with the same crash.

Assertion: osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)
ceph version 0.80.6-60-g5b5aba7 (5b5aba73031e901457ca27cf15600ce1ca90e258)
 1: (PG::choose_acting(pg_shard_t&)+0x1366) [0x750cc6]
 2: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x11e) [0x750f3e]
 3: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::GetLog, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb8) [0x797618]
 4: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x13a) [0x79797a]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x78246b]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x7825e4]
 7: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1d1) [0x731771]
 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x294) [0x6484e4]
 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x68f618]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa54601]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xa554f0]
 12: (()+0x8182) [0x7fba568c2182]
 13: (clone()+0x6d) [0x7fba5503638d]
Actions #20

Updated by Samuel Just over 9 years ago

Can you rerun with wip-sam-firefly-testing? (actually, ignore the firefly branch for the moment and use wip-sam-firefly-testing generally)

Actions #21

Updated by Samuel Just over 9 years ago

pre-firefly mons I think would also suffice to cause this bug. Actually, if you upgrade the osds from pre-firefly to firely and then the mons, I think you can still trigger the compat condition since the osdmap will still have empty features from when the osds booted with pre-firefly mons. That's not really a bug, but it would cause the compat condition and therefore the bug even if there were no pre-firefly osds or mons left. It also explains why restarting the osds fixed it.

Actions #22

Updated by Yuri Weinstein over 9 years ago

Sage has scheduled run on wip-9731-firefly http://pulpito.front.sepia.ceph.com/teuthology-2014-10-10_16:50:01-upgrade:firefly-firefly-distro-basic-multi/

And I repointed upgrade/firefly and upgrade/dumpling-x suites to run off wip-9731-firefly tonight.

Actions #23

Updated by Yuri Weinstein over 9 years ago

Results for the run teuthology-2014-10-11_19:00:02-upgrade:dumpling-x-wip-9731-firefly-distro-basic-multi

Still jobs '540593', '540598' have error

Assertion: osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

Actions #24

Updated by Samuel Just over 9 years ago

wip-9731-firefly does not have this patch.

Actions #25

Updated by Yuri Weinstein over 9 years ago

I added a new test (#9758) and testing it on ceph-qa-suites branch 'wip_9758' which is doing step upgrades v0.80.4-v0.80.5-v0.80.6-firefly in case we need it.

Actions #26

Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF