Project

General

Profile

Bug #23614

local_reserver double-reservation of backfilled pg

Added by Sage Weil almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

- pg gets reservations (incl local_reserver)
- pg backfills, finishes
- ...apparently enver releases the reservation...
- eio injection, goes back into recovery
- tries to reserve again, asserts out with

     0> 2018-04-09 16:34:28.458 7efbea99a700 -1 /build/ceph-13.0.2-821-g69354a7/src/common/AsyncReserver.h: In function 'void AsyncReserver<T>::request_reservation(T, Context*, unsigned int, Context*) [with T = spg_t]' thread 7efbea99a700 time 2018-04-09 16:34:28.463507
/build/ceph-13.0.2-821-g69354a7/src/common/AsyncReserver.h: 190: FAILED assert(!queue_pointers.count(item) && !in_progress.count(item))

 ceph version 13.0.2-821-g69354a7 (69354a7c01f17f2d50b965ccef0b24337323fba5) mimic (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7efc0d31dea2]
 2: (()+0x2de077) [0x7efc0d31e077]
 3: (AsyncReserver<spg_t>::request_reservation(spg_t, Context*, unsigned int, Context*)+0x193) [0x5580aceee9e3]
 4: (PG::RecoveryState::WaitLocalRecoveryReserved::WaitLocalRecoveryReserved(boost::statechart::state<PG::RecoveryState::WaitLocalRecoveryReserved, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_
::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x285) [0x5580acebd155]
 5: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::
statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x156) [0x5580acef3fb6]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x5580aced179b]
 7: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x25e) [0x5580aceb823e]
 8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x19c) [0x5580acdfb80c]
 9: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x5580ad05aa40]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x964) [0x5580ace06cb4]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x452) [0x7efc0d322df2]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7efc0d324df0]
 13: (()+0x76ba) [0x7efc0bdcd6ba]
 14: (clone()+0x6d) [0x7efc0b5f641d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/a/sage-2018-04-09_15:35:18-rados-wip-sage-testing-2018-04-09-0826-distro-basic-smithi/2375009


Related issues

Related to RADOS - Bug #23490: luminous: osd: double recovery reservation for PG when EIO injected (while already recovering other objects?) Duplicate 03/28/2018
Copied to RADOS - Backport #24332: mimic: local_reserver double-reservation of backfilled pg Resolved
Copied to RADOS - Backport #24333: luminous: local_reserver double-reservation of backfilled pg Resolved

History

#1 Updated by Sage Weil almost 6 years ago

Looking through the code I don't see where the reservation is supposed to be released. I see releases for

- the peering > clean path
log recovery > clean (no backfill)
backfill revert, full, unfound, or other cancellation

...but not for the normal, successful backfill completion. I must be missing something?

#2 Updated by Josh Durgin almost 6 years ago

  • Related to Bug #23490: luminous: osd: double recovery reservation for PG when EIO injected (while already recovering other objects?) added

#3 Updated by Josh Durgin almost 6 years ago

This may be the same root cause as http://tracker.ceph.com/issues/23490

#4 Updated by Neha Ojha almost 6 years ago

  • Assignee set to Neha Ojha

#5 Updated by Neha Ojha almost 6 years ago

  • Status changed from 12 to Fix Under Review

Explanation of the problem and resolution included in the pull request.

https://github.com/ceph/ceph/pull/22255

#6 Updated by Josh Durgin almost 6 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to mimic, luminous

#7 Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #24332: mimic: local_reserver double-reservation of backfilled pg added

#8 Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #24333: luminous: local_reserver double-reservation of backfilled pg added

#9 Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF