Project

General

Profile

Actions

Bug #23490

closed

luminous: osd: double recovery reservation for PG when EIO injected (while already recovering other objects?)

Added by Josh Durgin about 6 years ago. Updated almost 6 years ago.

Status:
Duplicate
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During a luminous test run, this was hit:

http://pulpito.ceph.com/yuriw-2018-03-27_21:16:27-rados-wip-yuri5-testing-2018-03-27-1535-luminous-distro-basic-smithi/2328855/

2018-03-27 22:55:57.475302 7fe60d42d700  0 bluestore(/var/lib/ceph/osd/ceph-1) read: inject random EIO
2018-03-27 22:55:57.475303 7fe60d42d700 10 bluestore(/var/lib/ceph/osd/ceph-1) read 2.6_head #2:61b476ff:::benchmark_data_smithi007_13420_object28822:head# 0x0~100000 = -5
2018-03-27 22:55:57.475305 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl
ean+remapped] rep_repair_primary_object 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head peers osd.{1,3,4}
2018-03-27 22:55:57.475313 7fe60d42d700 15 bluestore(/var/lib/ceph/osd/ceph-1) getattr 2.6_head #2:61b476ff:::benchmark_data_smithi007_13420_object28822:head# _
2018-03-27 22:55:57.475315 7fe60d42d700 10 bluestore(/var/lib/ceph/osd/ceph-1) getattr 2.6_head #2:61b476ff:::benchmark_data_smithi007_13420_object28822:head# _ = 0
2018-03-27 22:55:57.475322 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl
ean+remapped m=1] 2.6 unexpectedly missing 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head v38'1841, there should be a copy on shard 3
2018-03-27 22:55:57.475330 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl
ean+remapped m=1] 2.6 unexpectedly missing 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head v38'1841, there should be a copy on shard 4
2018-03-27 22:55:57.475338 7fe60d42d700 -1 log_channel(cluster) log [ERR] : 2.6 missing primary copy of 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head, will try copies on 3,4
2018-03-27 22:55:57.475349 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl
ean+remapped m=1]  read got -11 / 0 bytes from obj 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head
2018-03-27 22:55:57.475354 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl
ean+remapped m=1]  dropping ondisk_read_lock
2018-03-27 22:55:57.475362 7fe60d42d700 10 osd.1 430 dequeue_op 0x5597583e3880 finish
2018-03-27 22:55:57.475366 7fe60d42d700 20 osd.1 op_wq(6) _process empty q, waiting
2018-03-27 22:55:57.475368 7fe610c34700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+clean+remapped m=1] handle_peering_event: epoch_sent: 430 epoch_requested: 430 DoRecovery
2018-03-27 22:55:57.475384 7fe610c34700  5 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+clean+remapped m=1] exit Started/Primary/Active/Clean 0.162021 1 0.000379
2018-03-27 22:55:57.475394 7fe610c34700  5 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+remapped m=1] enter Started/Primary/Active/WaitLocalRecoveryReserved
2018-03-27 22:55:57.475401 7fe610c34700 20 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+recovery_wait+remapped m=1] get_recovery_priority recovery priority for pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+recovery_wait+remapped m=1] is 180, state is 4456450
2018-03-27 22:55:57.478231 7fe610c34700 -1 /build/ceph-12.2.4-147-ge20b2b1/src/common/AsyncReserver.h: In function 'void AsyncReserver<T>::request_reservation(T, Context*, unsigned int, Context*) [with T = spg_t]' thread 7fe610c34700 time 2018-03-27 22:55:57.475410
/build/ceph-12.2.4-147-ge20b2b1/src/common/AsyncReserver.h: 190: FAILED assert(!queue_pointers.count(item) && !in_progress.count(item))

 ceph version 12.2.4-147-ge20b2b1 (e20b2b191eaf3130aa015625b5494dd1f691295e) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x559744e75ad2]
 2: (AsyncReserver<spg_t>::request_reservation(spg_t, Context*, unsigned int, Context*)+0x203) [0x559744994c23]
 3: (PG::RecoveryState::WaitLocalRecoveryReserved::WaitLocalRecoveryReserved(boost::statechart::state<PG::RecoveryState::WaitLocalRecoveryReserved, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x248) [0x5597449615a8]
 4: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::WaitLocalRecoveryReserved, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xaa) [0x55974499e5ca]
 5: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x92) [0x55974499e842]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x69) [0x559744976139]
 7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x38d) [0x559744938c3d]
 8: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x29e) [0x55974487db6e]
 9: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x5597448f1527]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb9) [0x559744e7c9d9]
 11: (ThreadPool::WorkThread::entry()+0x10) [0x559744e7dae0]
 12: (()+0x76ba) [0x7fe62a7806ba]
 13: (clone()+0x6d) [0x7fe6297f741d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Not sure if this is fixed in master or not.


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #23614: local_reserver double-reservation of backfilled pgResolvedNeha Ojha04/09/2018

Actions
Actions #1

Updated by Greg Farnum about 6 years ago

  • Subject changed from luminous: osd: double recovery reservation for PG when EIO injected to luminous: osd: double recovery reservation for PG when EIO injected (while already recovering other objects?)
Actions #2

Updated by Josh Durgin about 6 years ago

David, can you look at this when you get a chance? I think it's due to EIO triggering recovery when recovery is already being reserved/going on.

Actions #3

Updated by Josh Durgin about 6 years ago

  • Assignee set to David Zafman
Actions #4

Updated by Josh Durgin about 6 years ago

  • Related to Bug #23614: local_reserver double-reservation of backfilled pg added
Actions #5

Updated by Josh Durgin almost 6 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF