Actions
Bug #23490
closedluminous: osd: double recovery reservation for PG when EIO injected (while already recovering other objects?)
Status:
Duplicate
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
During a luminous test run, this was hit:
2018-03-27 22:55:57.475302 7fe60d42d700 0 bluestore(/var/lib/ceph/osd/ceph-1) read: inject random EIO 2018-03-27 22:55:57.475303 7fe60d42d700 10 bluestore(/var/lib/ceph/osd/ceph-1) read 2.6_head #2:61b476ff:::benchmark_data_smithi007_13420_object28822:head# 0x0~100000 = -5 2018-03-27 22:55:57.475305 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl ean+remapped] rep_repair_primary_object 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head peers osd.{1,3,4} 2018-03-27 22:55:57.475313 7fe60d42d700 15 bluestore(/var/lib/ceph/osd/ceph-1) getattr 2.6_head #2:61b476ff:::benchmark_data_smithi007_13420_object28822:head# _ 2018-03-27 22:55:57.475315 7fe60d42d700 10 bluestore(/var/lib/ceph/osd/ceph-1) getattr 2.6_head #2:61b476ff:::benchmark_data_smithi007_13420_object28822:head# _ = 0 2018-03-27 22:55:57.475322 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl ean+remapped m=1] 2.6 unexpectedly missing 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head v38'1841, there should be a copy on shard 3 2018-03-27 22:55:57.475330 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl ean+remapped m=1] 2.6 unexpectedly missing 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head v38'1841, there should be a copy on shard 4 2018-03-27 22:55:57.475338 7fe60d42d700 -1 log_channel(cluster) log [ERR] : 2.6 missing primary copy of 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head, will try copies on 3,4 2018-03-27 22:55:57.475349 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl ean+remapped m=1] read got -11 / 0 bytes from obj 2:61b476ff:::benchmark_data_smithi007_13420_object28822:head 2018-03-27 22:55:57.475354 7fe60d42d700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+cl ean+remapped m=1] dropping ondisk_read_lock 2018-03-27 22:55:57.475362 7fe60d42d700 10 osd.1 430 dequeue_op 0x5597583e3880 finish 2018-03-27 22:55:57.475366 7fe60d42d700 20 osd.1 op_wq(6) _process empty q, waiting 2018-03-27 22:55:57.475368 7fe610c34700 10 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+clean+remapped m=1] handle_peering_event: epoch_sent: 430 epoch_requested: 430 DoRecovery 2018-03-27 22:55:57.475384 7fe610c34700 5 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+clean+remapped m=1] exit Started/Primary/Active/Clean 0.162021 1 0.000379 2018-03-27 22:55:57.475394 7fe610c34700 5 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+remapped m=1] enter Started/Primary/Active/WaitLocalRecoveryReserved 2018-03-27 22:55:57.475401 7fe610c34700 20 osd.1 pg_epoch: 430 pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+recovery_wait+remapped m=1] get_recovery_priority recovery priority for pg[2.6( v 40'2621 (30'1100,40'2621] local-lis/les=421/422 n=2621 ec=18/18 lis/c 421/421 les/c/f 422/430/0 420/421/421) [4,1]/[1,3] r=0 lpr=421 crt=40'2621 mlcod 0'0 active+recovery_wait+remapped m=1] is 180, state is 4456450 2018-03-27 22:55:57.478231 7fe610c34700 -1 /build/ceph-12.2.4-147-ge20b2b1/src/common/AsyncReserver.h: In function 'void AsyncReserver<T>::request_reservation(T, Context*, unsigned int, Context*) [with T = spg_t]' thread 7fe610c34700 time 2018-03-27 22:55:57.475410 /build/ceph-12.2.4-147-ge20b2b1/src/common/AsyncReserver.h: 190: FAILED assert(!queue_pointers.count(item) && !in_progress.count(item)) ceph version 12.2.4-147-ge20b2b1 (e20b2b191eaf3130aa015625b5494dd1f691295e) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x559744e75ad2] 2: (AsyncReserver<spg_t>::request_reservation(spg_t, Context*, unsigned int, Context*)+0x203) [0x559744994c23] 3: (PG::RecoveryState::WaitLocalRecoveryReserved::WaitLocalRecoveryReserved(boost::statechart::state<PG::RecoveryState::WaitLocalRecoveryReserved, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x248) [0x5597449615a8] 4: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::WaitLocalRecoveryReserved, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xaa) [0x55974499e5ca] 5: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x92) [0x55974499e842] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x69) [0x559744976139] 7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x38d) [0x559744938c3d] 8: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x29e) [0x55974487db6e] 9: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x5597448f1527] 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb9) [0x559744e7c9d9] 11: (ThreadPool::WorkThread::entry()+0x10) [0x559744e7dae0] 12: (()+0x76ba) [0x7fe62a7806ba] 13: (clone()+0x6d) [0x7fe6297f741d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Not sure if this is fixed in master or not.
Updated by Greg Farnum about 6 years ago
- Subject changed from luminous: osd: double recovery reservation for PG when EIO injected to luminous: osd: double recovery reservation for PG when EIO injected (while already recovering other objects?)
Updated by Josh Durgin about 6 years ago
David, can you look at this when you get a chance? I think it's due to EIO triggering recovery when recovery is already being reserved/going on.
Updated by Josh Durgin about 6 years ago
- Related to Bug #23614: local_reserver double-reservation of backfilled pg added
Actions