Bug #20432
closedpgid 0.7 has ref count of 2
0%
Description
-9> 2017-06-27 08:51:50.589930 30db9700 20 osd.1 106 kicking pg 0.4c -8> 2017-06-27 08:51:50.590911 30db9700 30 osd.1 pg_epoch: 106 pg[0.4c( empty local-lis/les=90/92 n=0 ec=85/1 lis/c 90/90 les/c/f 92/92/0 90/90/14) [5,1] r=1 lpr=92 crt=0'0 active] lock -7> 2017-06-27 08:51:50.593668 30db9700 20 osd.1 106 kicking pg 0.26 -6> 2017-06-27 08:51:50.593756 30db9700 30 osd.1 pg_epoch: 106 pg[0.26( v 106'8 (0'0,106'8] local-lis/les=13/15 n=0 ec=1/1 lis/c 13/13 les/c/f 15/15/0 13/13/12) [1,4] r=0 lpr=13 luod=0'0 crt=106'8 lcod 106'7 mlcod 0'0 active+clean] lock -5> 2017-06-27 08:51:50.600041 30db9700 20 osd.1 106 kicking pg 0.3d -4> 2017-06-27 08:51:50.600129 30db9700 30 osd.1 pg_epoch: 106 pg[0.3d( v 106'12 (0'0,106'12] local-lis/les=13/16 n=0 ec=1/1 lis/c 13/13 les/c/f 16/16/0 13/13/12) [1,4] r=0 lpr=13 luod=0'0 crt=106'12 lcod 106'11 mlcod 0'0 active+clean] lock -3> 2017-06-27 08:51:50.601188 30db9700 20 osd.1 106 kicking pg 0.7 -2> 2017-06-27 08:51:50.601270 30db9700 30 osd.1 pg_epoch: 106 pg[0.7( v 106'2 (0'0,106'2] local-lis/les=62/64 n=0 ec=1/1 lis/c 62/62 les/c/f 64/64/0 56/62/12) [1,0] r=0 lpr=62 luod=0'0 crt=106'2 lcod 106'1 mlcod 0'0 active+clean] lock -1> 2017-06-27 08:51:50.602577 30db9700 -1 osd.1 106 pgid 0.7 has ref count of 2 0> 2017-06-27 08:51:50.641868 30db9700 -1 *** Caught signal (Aborted) ** in thread 30db9700 thread_name:signal_handler ceph version 12.0.3-2303-g5b3f1e7 (5b3f1e7e836a4b8acbb848ba94a838bfd6f71408) luminous (dev) 1: (()+0x9c9879) [0xad1879] 2: (()+0x10330) [0xb91d330] 3: (gsignal()+0x37) [0xc79ec37] 4: (abort()+0x148) [0xc7a2028] 5: (OSD::shutdown()+0x18da) [0x5d49ea] 6: (OSD::handle_signal(int)+0x10c) [0x5d4ddc] 7: (SignalHandler::entry()+0x1b7) [0xad2b57] 8: (()+0x8184) [0xb915184] 9: (clone()+0x6d) [0xc86237d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this./a/sage-2017-06-27_05:44:05-rados-wip-sage-testing-distro-basic-smithi/1331900
Updated by Sage Weil almost 7 years ago
- Related to Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty()) added
Updated by Kefu Chai almost 7 years ago
- Has duplicate Bug #20419: OSD aborts when shutting down added
Updated by Kefu Chai almost 7 years ago
rerunning at http://pulpito.ceph.com/kchai-2017-06-30_06:49:46-rados-master-distro-basic-smithi/, if we can consistently reproduce this issue. i will compile a test build with "-DPG_DEBUG_REFS=ON", and try again.
Updated by Kefu Chai almost 7 years ago
-779> 2017-06-30 11:21:33.800250 2a512700 5 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] enter Started/Primary/Active/Clean -778> 2017-06-30 11:21:33.800466 2a512700 10 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] finish_recovery -777> 2017-06-30 11:21:33.800671 2a512700 10 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] clear_recovery_state -776> 2017-06-30 11:21:33.813503 2a512700 20 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] get_with_id: 1.c got id 75 (new) ref==4 -775> 2017-06-30 11:21:33.813838 2a512700 10 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] share_pg_info -774> 2017-06-30 11:21:33.814073 2a512700 15 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] publish_stats_to_osd 110:48
so the reference held by C_PG_FinishRecovery
was not relinquished.
Updated by Sage Weil almost 7 years ago
bisecting this... so far i've narrowed it down to something between f43c5fa055386455a263802b0908ddc96a95b1b0 and e97210e70acec3f045689d19a3dc5811859534b9 (see wip-20432-? branches)
Updated by Sage Weil almost 7 years ago
I suspect 0780f9e67801f400d78ac704c65caaa98e968bbc, which changed when the CLEAN flag was set at the end of recovery.
Updated by Kefu Chai almost 7 years ago
-4> 2017-06-30 11:21:42.425226 32522700 0 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 luod=0'0 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] dump_live_ids: 1.c live ids: -3> 2017-06-30 11:21:42.425475 32522700 0 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 luod=0'0 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] id: 75, ceph version 12.0.3-2432-g41745b4 (41745b443ece9756a3eb8becb530f41552677 cdb) luminous (rc) 1: (PG::get_with_id()+0x7a) [0x65e8ea] 2: (PG::finish_recovery(std::list<Context*, std::allocator<Context*> >&)+0x14e) [0x65fede] 3: (PG::RecoveryState::Clean::Clean(boost::statechart::state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na , mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history _mode)0>::my_context)+0x185) [0x682705] 4: (boost::statechart::simple_state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl( boost::statechart::event_base const&, void const*)+0x166) [0x6e3ab6] 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_even t(boost::statechart::event_base const&)+0x5b) [0x6c2b4b] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_q ueued_events()+0x91) [0x6c2c81] 7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1df) [0x68fa2f] 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x212) [0x5e5f02] 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x641637] 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0xb1ce05] 11: (ThreadPool::WorkThread::entry()+0x10) [0xb1ddd0] 12: (()+0x8184) [0xb920184] 13: (clone()+0x6d) [0xc86d37d]
Updated by Kefu Chai almost 7 years ago
i reverted the changes introduced by 0780f9e67801f400d78ac704c65caaa98e968bbc and tested the verify test at:
/a/kchai-2017-07-03_04:02:08-rados-wip-20432-kefu-distro-basic-smithi/1355515/remote/smithi130/log/ceph-osd.4.log.gz
the leak persists:
-4> 2017-07-03 04:31:26.606726 3d5e6700 0 osd.4 pg_epoch: 128 pg[1.1( v 116'6 (0'0,116'6] local-lis/les=124/126 n=0 ec=10/10 lis/c 124/124 les/c/f 126/126/0 124/124/124) [4] r =0 lpr=124 luod=0'0 crt=116'6 lcod 116'5 mlcod 0'0 active+undersized+degraded] dump_live_ids: 1.1 live ids: -3> 2017-07-03 04:31:26.607146 3d5e6700 0 osd.4 pg_epoch: 128 pg[1.1( v 116'6 (0'0,116'6] local-lis/les=124/126 n=0 ec=10/10 lis/c 124/124 les/c/f 126/126/0 124/124/124) [4] r =0 lpr=124 luod=0'0 crt=116'6 lcod 116'5 mlcod 0'0 active+undersized+degraded] id: 105, ceph version 12.0.3-2485-g6d9d7e0 (6d9d7e0558897f932d9be3146041f9aec3ca452f) lu minous (rc) 1: (PG::get_with_id()+0x8d) [0x660dbd] 2: (PG::queue_recovery(bool)+0x36e) [0x661a0e] 3: (PG::RecoveryState::Active::react(PG::ActMap const&)+0x5a3) [0x688023] 4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::st atechart::event_base const&, void const*)+0x24c) [0x6e7c9c] 5: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl _::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boos t::statechart::event_base const&, void const*)+0x6a) [0x6e418a] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_even t(boost::statechart::event_base const&)+0x6b) [0x6c764b] 7: (PG::handle_activate_map(PG::RecoveryCtx*)+0x120) [0x694e30] 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<TrackedIntPtr<PG>, std::less<TrackedIntPtr<PG> >, std::allocator<TrackedIntPtr<PG> > >*)+0 x71f) [0x5e3b3f] 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x173) [0x5e4163] 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x642447] 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0xb3ead9] 12: (ThreadPool::WorkThread::entry()+0x10) [0xb3fab0] 13: (()+0x7dc5) [0xc7e5dc5] 14: (clone()+0x6d) [0xd72c76d]
Updated by Kefu Chai almost 7 years ago
seems PG::recovery_queued
is reset somehow after being set in PG::queue_recovery()
, but the PG is not removed from the OSDService::awaiting_throttle
by PrimaryLogPG::on_change()
.
Updated by Kefu Chai almost 7 years ago
-308> 2017-07-03 14:59:39.886566 35e4d700 20 osd.5 182 kicking pg 1.e -307> 2017-07-03 14:59:39.886610 35e4d700 30 osd.5 pg_epoch: 182 pg[1.e( v 168'4 (0'0,168'4] local-lis/les=169/171 n=0 ec=29/9 lis/c 169/169 les/c/f 171/171/0 169/169/169) [5] r= 0 lpr=169 rq=1 crt=168'4 lcod 168'3 mlcod 0'0 active+undersized+degraded] lock ... -98> 2017-07-03 14:59:42.057672 35e4d700 0 osd.5 pg_epoch: 182 pg[1.e(unlocked)] dump_live_ids: 1.e live ids: -97> 2017-07-03 14:59:42.057820 35e4d700 0 osd.5 pg_epoch: 182 pg[1.e(unlocked)] id: 125, ceph version 12.0.3-2487-g6dd471e (6dd471ec0c36b20ed3211316f796603625cb6c29) luminous (rc) 1: (PG::get_with_id()+0x8d) [0x660dcd] 2: (PG::queue_recovery(bool)+0x36e) [0x661a1e] 3: (PG::RecoveryState::Active::react(PG::ActMap const&)+0x5a3) [0x688043] 4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x24c) [0x6e7c8c] 5: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x6a) [0x6e444a] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x6c763b] 7: (PG::handle_activate_map(PG::RecoveryCtx*)+0x120) [0x694e50] 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<TrackedIntPtr<PG>, std::less<TrackedIntPtr<PG> >, std::allocator<TrackedIntPtr<PG> > >*)+0x71f) [0x5e3b2f] 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x173) [0x5e4153] 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x642437] 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0xb3ecc9] 12: (ThreadPool::WorkThread::entry()+0x10) [0xb3fca0] 13: (()+0x7dc5) [0xc5c7dc5] 14: (clone()+0x6d) [0xd50e73d]
see https://github.com/ceph/ceph-ci/commit/6dd471ec0c36b20ed3211316f796603625cb6c29
see /a/kchai-2017-07-03_14:30:36-rados-wip-20432-kefu-distro-basic-mira/1356142/remote/*/log/ceph-osd.5.log.gz
so pg 1.e was still in the "awaiting_throttle" queue when it was being kicked. so prbly, we should osd->clear_queued_recovery(this)
also if (recovery_queued)
in PrimaryLogPG::on_shutdown()
Updated by Kefu Chai almost 7 years ago
- Status changed from 12 to Fix Under Review
- Component(RADOS) OSD added
Updated by Josh Durgin almost 7 years ago
- Status changed from Fix Under Review to Resolved
Updated by Kefu Chai almost 7 years ago
- Status changed from Resolved to 12
/a/kchai-2017-07-05_04:38:56-rados-wip-kefu-testing2-distro-basic-mira/1363113
-3> 2017-07-05 06:35:34.518907 34899700 20 osd.4 157 kicking pg 1.1 -2> 2017-07-05 06:35:34.521959 34899700 30 osd.4 pg_epoch: 157 pg[1.1( empty local-lis/les=155/156 n=0 ec=9/9 lis/c 155/141 les/c/f 156/142/0 155/155/155) [4] r=0 lpr=155 pi=[1 41,155)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] lock -1> 2017-07-05 06:35:34.523914 34899700 -1 osd.4 157 pgid 1.1 has ref count of 2 0> 2017-07-05 06:35:34.626884 34899700 -1 *** Caught signal (Aborted) ** in thread 34899700 thread_name:signal_handler
Updated by Kefu Chai almost 7 years ago
-5> 2017-07-05 11:03:17.052787 31e89700 -1 osd.5 143 pgid 1.c has ref count of 2 -4> 2017-07-05 11:03:17.054266 31e89700 0 osd.5 pg_epoch: 143 pg[1.c( empty local-lis/les=137/139 n=0 ec=65/9 lis/c 137/107 les/c/f 139/110/0 137/137/137) [5] r=0 lpr=137 pi=[ 107,137)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] dump_live_ids: 1.c live ids: -3> 2017-07-05 11:03:17.054571 31e89700 0 osd.5 pg_epoch: 143 pg[1.c( empty local-lis/les=137/139 n=0 ec=65/9 lis/c 137/107 les/c/f 139/110/0 137/137/137) [5] r=0 lpr=137 pi=[ 107,137)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] id: 35, ceph version 12.0.3-2523-g7bee03d (7bee03de494aa52dc7fbb37ba9a140ec7f84d3a9) luminous (rc) 1: (PG::get_with_id()+0x8d) [0x6616ad] 2: (PG::finish_recovery(std::list<Context*, std::allocator<Context*> >&)+0x15e) [0x662dfe] 3: (PG::RecoveryState::Clean::Clean(boost::statechart::state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na , mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history _mode)0>::my_context)+0x185) [0x6862b5] 4: (boost::statechart::simple_state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl( boost::statechart::event_base const&, void const*)+0x18e) [0x6ea16e] 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_even t(boost::statechart::event_base const&)+0x6b) [0x6c7f5b] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_q ueued_events()+0x91) [0x6c80b1] 7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1fe) [0x693b0e] 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22a) [0x5e4b5a] 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x642d47] 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0xb3fcb9] 11: (ThreadPool::WorkThread::entry()+0x10) [0xb40c90] 12: (()+0x7dc5) [0xc5c8dc5] 13: (clone()+0x6d) [0xd50f73d]
like #20432-9, the on_safe
callbacks are not finished.
Updated by Kefu Chai almost 7 years ago
at the end of OSD::process_peering_events()
, dispatch_context(rctx, 0, curmap, &handle)
is called, which just deletes on_safe
because pg is always nullptr here. so the contexts in on_safe
are never finished.
-1974> 2017-07-06 04:14:47.686471 2a098700 10 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76, 128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] needs_backfill does not need backfill -1973> 2017-07-06 04:14:47.686909 2a098700 10 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] activate all replicas clean, no recovery .. -1950> 2017-07-06 04:14:47.696257 2a098700 10 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] handle_peering_event: epoch_sent: 129 epoch_requested: 129 AllReplicasRecovered ... -1941> 2017-07-06 04:14:47.700946 2a098700 5 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] enter Started/Primary/Active/Clean
but why the PG entered "Clean" even it had only one active OSD?
/a/kchai-2017-07-06_03:48:26-rados-wip-20432-kefu-distro-basic-mira/1365970/remote/*/log/ceph-osd.0.log.gz
Updated by Kefu Chai almost 7 years ago
- Status changed from 12 to Fix Under Review
Updated by Kefu Chai almost 7 years ago
-1216> 2017-07-06 12:46:55.396103 2c80e700 20 osd.5 pg_epoch: 672 pg[1.39( empty local-lis/les=663/666 n=0 ec=230/9 lis/c 663/580 les/c/f 666/582/0 663/663/663) [5] r=0 lpr=663 pi =[580,663)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] get_with_id: 1.39 got id 77 (new) ref==11 ... -551> 2017-07-06 12:46:55.640280 3ade1700 20 osd.5 672 kicking pg 1.39 ... -5> 2017-07-06 12:46:57.861134 3ade1700 -1 osd.5 672 pgid 1.39 has ref count of 2 -4> 2017-07-06 12:46:57.862444 3ade1700 0 osd.5 pg_epoch: 672 pg[1.39( empty local-lis/les=663/666 n=0 ec=230/9 lis/c 663/580 les/c/f 666/582/0 663/663/663) [5] r=0 lpr=663 pi =[580,663)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] dump_live_ids: 1.39 live ids: -3> 2017-07-06 12:46:57.862792 3ade1700 0 osd.5 pg_epoch: 672 pg[1.39( empty local-lis/les=663/666 n=0 ec=230/9 lis/c 663/580 les/c/f 666/582/0 663/663/663) [5] r=0 lpr=663 pi =[580,663)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] id: 77, ceph version 12.0.3-2567-gcfc638c (cfc638c70816b69357715ba31c5689c9ab862807) luminous (rc) 1: (PG::get_with_id()+0x8d) [0x66198d] 2: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x619) [0x5da209] 3: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xf15) [0x5fa195] 4: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8e9) [0xb3eff9] 5: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb41180] 6: (()+0x7dc5) [0xc5d7dc5] 7: (clone()+0x6d) [0xd51e76d]
OSDService::start_shutdown() does call recovery_sleep_timer.shutdown(), which in turn cancels and deletes all events callbacks. in we should be okay, but i will print more log to find out why (or if) the timer failed to clear the recovery_requeue_callback
callback.
/a/kchai-2017-07-06_10:41:04-rados-wip-20432-kefu-distro-basic-smithi/1367534/remote/*/log/ceph-osd.5.log.gz
Updated by Sage Weil almost 7 years ago
- Status changed from Fix Under Review to Resolved
Updated by Kefu Chai almost 7 years ago
- Status changed from Resolved to Fix Under Review
https://github.com/ceph/ceph/pull/16201
i swear: this is the last PR for this ticket!
Updated by Kefu Chai almost 7 years ago
- Status changed from Fix Under Review to Resolved