Bug #9696: Upgrade from 0.80.5 to 0.80.6 causes OSDs to go down with FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting) - Ceph - Ceph

Actions

Copy link

Bug #9696

closed

Upgrade from 0.80.5 to 0.80.6 causes OSDs to go down with FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

Added by Florian Haas over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

100%

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After an upgrade from 0.80.5 to 0.80.6, almost all OSDs went down after hitting the following failed assertion:

osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

 ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)
 1: (PG::choose_acting(pg_shard_t&)+0x13e5) [0x76b355]
 2: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::
na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x15c) [0x77905c]
 3: (boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_
::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Peering> c
onst&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::sta
techart::null_exception_translator>&)+0x5c) [0x7b014c]
[...]

This assert was only added recently, in https://github.com/ceph/ceph/commit/92cfd370395385ca5537b5bc72220934c9f09026, and subsequently backported to firefly.

The cluster in question has pools with an unusually low number of PGs (8), and uses the osd max backfills default of 10. I'm not 100% sure what that assert does, but could it be that it doesn't correctly account for the possibility of osd_max_backfills > pg_num?

As it stands, the issue is pretty debilitating as it can take out an entire cluster with one run of unattended upgrades, which isn't exactly an insane thing to do for just a point release.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

For the record http://tracker.ceph.com/issues/9715 hits the same assert in similar conditions in teuthology and the full logs are available.

Actions

Copy link

Updated by Florian Haas over 9 years ago

Whoa, wait -- Loïc, are you saying this actually failed a test case and still made it into a release in a stable version?

Actions

Copy link

Updated by Ian Colle over 9 years ago

It actually failed a new test case AFTER it went out into a stable release version.

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from New to In Progress

https://github.com/ceph/ceph/pull/2682

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

running in gitbuilder under the branch wip-9696-compat-acting

Actions

Copy link

Updated by Samuel Just over 9 years ago

Status changed from Fix Under Review to 12
Assignee set to Samuel Just
Priority changed from Normal to Immediate

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from 12 to Fix Under Review
Assignee deleted (~~Samuel Just~~)
Priority changed from Immediate to Normal

https://github.com/ceph/ceph/pull/2684

Actions

Copy link

#10

Updated by Loïc Dachary over 9 years ago

Status changed from Fix Under Review to 12
Assignee set to Samuel Just
Priority changed from Normal to Immediate

Actions

Copy link

#11

Updated by Samuel Just over 9 years ago

Can you restart one of the crashing osds with

debug osd = 20
debug filestore = 20
debug ms = 1 ?

As far as we can tell, this bug requires an older osd somewhere in the cluster (older than firefly) to trigger a compatibility mode.

Actions

Copy link

#12

Updated by Samuel Just over 9 years ago

Status changed from 12 to Fix Under Review

Actions

Copy link

#13

Updated by Samuel Just over 9 years ago

https://github.com/ceph/ceph/pull/2684/files

Actions

Copy link

#14

Updated by Samuel Just over 9 years ago

wip-9696-firefly removes the assert on firefly, it's not valid for the compat case.

Actions

Copy link

#15

Updated by Florian Haas over 9 years ago

Sam, I can confirm with certainty that this did not happen during an upgrade from dumpling. All nodes were running 0.80.5 prior to the upgrade.

Actions

Copy link

#16

Updated by Florian Haas over 9 years ago

Also, could either Loïc or Sam explain what exact combination of circumstances causes this assert to trigger? I can't believe this would blow up every cluster out there no matter what, as we'd have heard about it from more users if that were the case. So it would be good if we were able to tell users, if X and Y and Z, DO NOT upgrade until 0.80.7 is out.

Actions

Copy link

#17

Updated by Samuel Just over 9 years ago

Ok, can you reproduce with the logging above?

Actions

Copy link

#18

Updated by Samuel Just over 9 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#19

Updated by Yuri Weinstein over 9 years ago

I guess it's expected as backport is still pending.
Update:

In the run http://pulpito.front.sepia.ceph.com/teuthology-2014-10-10_19:00:01-upgrade:dumpling-x-firefly-distro-basic-multi/

Jobs '537891', '537900', '537901', '537908' failed with the same crash.

Assertion: osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)
ceph version 0.80.6-60-g5b5aba7 (5b5aba73031e901457ca27cf15600ce1ca90e258)
 1: (PG::choose_acting(pg_shard_t&)+0x1366) [0x750cc6]
 2: (PG::RecoveryState::GetLog::GetLog(boost::statechart::state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x11e) [0x750f3e]
 3: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::GetLog, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb8) [0x797618]
 4: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x13a) [0x79797a]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x78246b]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x7825e4]
 7: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1d1) [0x731771]
 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x294) [0x6484e4]
 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x68f618]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa54601]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xa554f0]
 12: (()+0x8182) [0x7fba568c2182]
 13: (clone()+0x6d) [0x7fba5503638d]

Actions

Copy link

#20

Updated by Samuel Just over 9 years ago

Can you rerun with wip-sam-firefly-testing? (actually, ignore the firefly branch for the moment and use wip-sam-firefly-testing generally)

Actions

Copy link

#21

Updated by Samuel Just over 9 years ago

pre-firefly mons I think would also suffice to cause this bug. Actually, if you upgrade the osds from pre-firefly to firely and then the mons, I think you can still trigger the compat condition since the osdmap will still have empty features from when the osds booted with pre-firefly mons. That's not really a bug, but it would cause the compat condition and therefore the bug even if there were no pre-firefly osds or mons left. It also explains why restarting the osds fixed it.

Actions

Copy link

#22

Updated by Yuri Weinstein over 9 years ago

Sage has scheduled run on wip-9731-firefly http://pulpito.front.sepia.ceph.com/teuthology-2014-10-10_16:50:01-upgrade:firefly-firefly-distro-basic-multi/

And I repointed upgrade/firefly and upgrade/dumpling-x suites to run off wip-9731-firefly tonight.

Actions

Copy link

#23

Updated by Yuri Weinstein over 9 years ago

Results for the run teuthology-2014-10-11_19:00:02-upgrade:dumpling-x-wip-9731-firefly-distro-basic-multi

Still jobs '540593', '540598' have error

Assertion: osd/PG.cc: 1284: FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

Actions

Copy link

#24

Updated by Samuel Just over 9 years ago

wip-9731-firefly does not have this patch.

Actions

Copy link

#25

Updated by Yuri Weinstein over 9 years ago

I added a new test (#9758) and testing it on ceph-qa-suites branch 'wip_9758' which is doing step upgrades v0.80.4-v0.80.5-v0.80.6-firefly in case we need it.

Actions

Copy link

#26

Updated by Sage Weil over 9 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

#27

Updated by Florian Haas over 9 years ago

Adding links for commits fixing this issue here for reference:

https://github.com/ceph/ceph/commit/9b18d99817c8b54e30dff45047dfe1b29871d659 (master)
https://github.com/ceph/ceph/commit/c5fd2d043ed4aa4fdb60fc19a284f51a86cef408 (firefly)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #9696

Upgrade from 0.80.5 to 0.80.6 causes OSDs to go down with FAILED assert(want_acting_backfill.size() - want_backfill.size() == num_want_acting)

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Ian Colle over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Florian Haas over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Yuri Weinstein over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Florian Haas over 9 years ago