https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2017-06-27T15:18:50Z
Ceph
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=93816
2017-06-27T15:18:50Z
Sage Weil
sage@newdream.net
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-9 priority-5 priority-high3 closed" href="/issues/20000">Bug #20000</a>: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())</i> added</li></ul>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=93897
2017-06-28T15:07:37Z
Sage Weil
sage@newdream.net
<ul><li><strong>Assignee</strong> set to <i>Kefu Chai</i></li></ul>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=93899
2017-06-28T15:08:02Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Duplicated by</strong> <i><a class="issue tracker-1 status-10 priority-6 priority-high2 closed" href="/issues/20419">Bug #20419</a>: OSD aborts when shutting down </i> added</li></ul>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94095
2017-06-30T06:54:40Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><p>rerunning at <a class="external" href="http://pulpito.ceph.com/kchai-2017-06-30_06:49:46-rados-master-distro-basic-smithi/">http://pulpito.ceph.com/kchai-2017-06-30_06:49:46-rados-master-distro-basic-smithi/</a>, if we can consistently reproduce this issue. i will compile a test build with "-DPG_DEBUG_REFS=ON", and try again.</p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94160
2017-07-01T15:06:46Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><p><a class="external" href="http://pulpito.ceph.com/kchai-2017-06-30_10:58:17-rados-wip-20432-kefu-distro-basic-smithi/">http://pulpito.ceph.com/kchai-2017-06-30_10:58:17-rados-wip-20432-kefu-distro-basic-smithi/</a></p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94180
2017-07-02T13:04:03Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><pre>
-779> 2017-06-30 11:21:33.800250 2a512700 5 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] enter Started/Primary/Active/Clean
-778> 2017-06-30 11:21:33.800466 2a512700 10 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] finish_recovery
-777> 2017-06-30 11:21:33.800671 2a512700 10 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] clear_recovery_state
-776> 2017-06-30 11:21:33.813503 2a512700 20 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] get_with_id: 1.c got id 75 (new) ref==4
-775> 2017-06-30 11:21:33.813838 2a512700 10 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] share_pg_info
-774> 2017-06-30 11:21:33.814073 2a512700 15 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] publish_stats_to_osd 110:48
</pre>
<p>so the reference held by <code>C_PG_FinishRecovery</code> was not relinquished.</p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94183
2017-07-02T18:51:11Z
Sage Weil
sage@newdream.net
<ul></ul><p>bisecting this... so far i've narrowed it down to something between f43c5fa055386455a263802b0908ddc96a95b1b0 and e97210e70acec3f045689d19a3dc5811859534b9 (see wip-20432-? branches)</p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94184
2017-07-02T18:52:28Z
Sage Weil
sage@newdream.net
<ul></ul><p>I suspect 0780f9e67801f400d78ac704c65caaa98e968bbc, which changed when the CLEAN flag was set at the end of recovery.</p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94185
2017-07-03T02:20:51Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><pre>
-4> 2017-06-30 11:21:42.425226 32522700 0 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0
lpr=108 pi=[93,108)/1 luod=0'0 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] dump_live_ids: 1.c live ids:
-3> 2017-06-30 11:21:42.425475 32522700 0 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0
lpr=108 pi=[93,108)/1 luod=0'0 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] id: 75, ceph version 12.0.3-2432-g41745b4 (41745b443ece9756a3eb8becb530f41552677
cdb) luminous (rc)
1: (PG::get_with_id()+0x7a) [0x65e8ea]
2: (PG::finish_recovery(std::list<Context*, std::allocator<Context*> >&)+0x14e) [0x65fede]
3: (PG::RecoveryState::Clean::Clean(boost::statechart::state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na
, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history
_mode)0>::my_context)+0x185) [0x682705]
4: (boost::statechart::simple_state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(
boost::statechart::event_base const&, void const*)+0x166) [0x6e3ab6]
5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_even
t(boost::statechart::event_base const&)+0x5b) [0x6c2b4b]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_q
ueued_events()+0x91) [0x6c2c81]
7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1df) [0x68fa2f]
8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x212) [0x5e5f02]
9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x641637]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0xb1ce05]
11: (ThreadPool::WorkThread::entry()+0x10) [0xb1ddd0]
12: (()+0x8184) [0xb920184]
13: (clone()+0x6d) [0xc86d37d]
</pre>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94187
2017-07-03T05:12:17Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><p>@Sage,</p>
<p>i reverted the changes introduced by 0780f9e67801f400d78ac704c65caaa98e968bbc and tested the verify test at:</p>
<p>/a/kchai-2017-07-03_04:02:08-rados-wip-20432-kefu-distro-basic-smithi/1355515/remote/smithi130/log/ceph-osd.4.log.gz</p>
<p>the leak persists:</p>
<pre>
-4> 2017-07-03 04:31:26.606726 3d5e6700 0 osd.4 pg_epoch: 128 pg[1.1( v 116'6 (0'0,116'6] local-lis/les=124/126 n=0 ec=10/10 lis/c 124/124 les/c/f 126/126/0 124/124/124) [4] r
=0 lpr=124 luod=0'0 crt=116'6 lcod 116'5 mlcod 0'0 active+undersized+degraded] dump_live_ids: 1.1 live ids:
-3> 2017-07-03 04:31:26.607146 3d5e6700 0 osd.4 pg_epoch: 128 pg[1.1( v 116'6 (0'0,116'6] local-lis/les=124/126 n=0 ec=10/10 lis/c 124/124 les/c/f 126/126/0 124/124/124) [4] r
=0 lpr=124 luod=0'0 crt=116'6 lcod 116'5 mlcod 0'0 active+undersized+degraded] id: 105, ceph version 12.0.3-2485-g6d9d7e0 (6d9d7e0558897f932d9be3146041f9aec3ca452f) lu
minous (rc)
1: (PG::get_with_id()+0x8d) [0x660dbd]
2: (PG::queue_recovery(bool)+0x36e) [0x661a0e]
3: (PG::RecoveryState::Active::react(PG::ActMap const&)+0x5a3) [0x688023]
4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::st
atechart::event_base const&, void const*)+0x24c) [0x6e7c9c]
5: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boos
t::statechart::event_base const&, void const*)+0x6a) [0x6e418a]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_even
t(boost::statechart::event_base const&)+0x6b) [0x6c764b]
7: (PG::handle_activate_map(PG::RecoveryCtx*)+0x120) [0x694e30]
8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<TrackedIntPtr<PG>, std::less<TrackedIntPtr<PG> >, std::allocator<TrackedIntPtr<PG> > >*)+0
x71f) [0x5e3b3f]
9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x173) [0x5e4163]
10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x642447]
11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0xb3ead9]
12: (ThreadPool::WorkThread::entry()+0x10) [0xb3fab0]
13: (()+0x7dc5) [0xc7e5dc5]
14: (clone()+0x6d) [0xd72c76d]
</pre>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94202
2017-07-03T10:51:27Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><p>seems <code>PG::recovery_queued</code> is reset somehow after being set in <code>PG::queue_recovery()</code>, but the PG is not removed from the <code>OSDService::awaiting_throttle</code> by <code>PrimaryLogPG::on_change()</code>.</p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94217
2017-07-03T15:58:49Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><pre>
-308> 2017-07-03 14:59:39.886566 35e4d700 20 osd.5 182 kicking pg 1.e
-307> 2017-07-03 14:59:39.886610 35e4d700 30 osd.5 pg_epoch: 182 pg[1.e( v 168'4 (0'0,168'4] local-lis/les=169/171 n=0 ec=29/9 lis/c 169/169 les/c/f 171/171/0 169/169/169) [5] r=
0 lpr=169 rq=1 crt=168'4 lcod 168'3 mlcod 0'0 active+undersized+degraded] lock
...
-98> 2017-07-03 14:59:42.057672 35e4d700 0 osd.5 pg_epoch: 182 pg[1.e(unlocked)] dump_live_ids: 1.e live ids:
-97> 2017-07-03 14:59:42.057820 35e4d700 0 osd.5 pg_epoch: 182 pg[1.e(unlocked)] id: 125, ceph version 12.0.3-2487-g6dd471e (6dd471ec0c36b20ed3211316f796603625cb6c29) luminous (rc)
1: (PG::get_with_id()+0x8d) [0x660dcd]
2: (PG::queue_recovery(bool)+0x36e) [0x661a1e]
3: (PG::RecoveryState::Active::react(PG::ActMap const&)+0x5a3) [0x688043]
4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x24c) [0x6e7c8c]
5: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x6a) [0x6e444a]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x6c763b]
7: (PG::handle_activate_map(PG::RecoveryCtx*)+0x120) [0x694e50]
8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<TrackedIntPtr<PG>, std::less<TrackedIntPtr<PG> >, std::allocator<TrackedIntPtr<PG> > >*)+0x71f) [0x5e3b2f]
9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x173) [0x5e4153]
10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x642437]
11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0xb3ecc9]
12: (ThreadPool::WorkThread::entry()+0x10) [0xb3fca0]
13: (()+0x7dc5) [0xc5c7dc5]
14: (clone()+0x6d) [0xd50e73d]
</pre>
<p>see <a class="external" href="https://github.com/ceph/ceph-ci/commit/6dd471ec0c36b20ed3211316f796603625cb6c29">https://github.com/ceph/ceph-ci/commit/6dd471ec0c36b20ed3211316f796603625cb6c29</a></p>
<p>see /a/kchai-2017-07-03_14:30:36-rados-wip-20432-kefu-distro-basic-mira/1356142/remote/*/log/ceph-osd.5.log.gz</p>
<p>so pg 1.e was still in the "awaiting_throttle" queue when it was being kicked. so prbly, we should <code>osd->clear_queued_recovery(this)</code> also <code>if (recovery_queued)</code> in <code>PrimaryLogPG::on_shutdown()</code></p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94231
2017-07-04T05:49:01Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>Fix Under Review</i></li><li><strong>Component(RADOS)</strong> <i>OSD</i> added</li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/16093">https://github.com/ceph/ceph/pull/16093</a></p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94233
2017-07-04T07:07:55Z
Josh Durgin
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Resolved</i></li></ul>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94323
2017-07-05T08:08:57Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>12</i></li></ul><p>/a/kchai-2017-07-05_04:38:56-rados-wip-kefu-testing2-distro-basic-mira/1363113</p>
<pre>
-3> 2017-07-05 06:35:34.518907 34899700 20 osd.4 157 kicking pg 1.1
-2> 2017-07-05 06:35:34.521959 34899700 30 osd.4 pg_epoch: 157 pg[1.1( empty local-lis/les=155/156 n=0 ec=9/9 lis/c 155/141 les/c/f 156/142/0 155/155/155) [4] r=0 lpr=155 pi=[1
41,155)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] lock
-1> 2017-07-05 06:35:34.523914 34899700 -1 osd.4 157 pgid 1.1 has ref count of 2
0> 2017-07-05 06:35:34.626884 34899700 -1 *** Caught signal (Aborted) **
in thread 34899700 thread_name:signal_handler
</pre>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94332
2017-07-05T11:33:57Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><pre>
-5> 2017-07-05 11:03:17.052787 31e89700 -1 osd.5 143 pgid 1.c has ref count of 2
-4> 2017-07-05 11:03:17.054266 31e89700 0 osd.5 pg_epoch: 143 pg[1.c( empty local-lis/les=137/139 n=0 ec=65/9 lis/c 137/107 les/c/f 139/110/0 137/137/137) [5] r=0 lpr=137 pi=[
107,137)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] dump_live_ids: 1.c live ids:
-3> 2017-07-05 11:03:17.054571 31e89700 0 osd.5 pg_epoch: 143 pg[1.c( empty local-lis/les=137/139 n=0 ec=65/9 lis/c 137/107 les/c/f 139/110/0 137/137/137) [5] r=0 lpr=137 pi=[
107,137)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] id: 35, ceph version 12.0.3-2523-g7bee03d (7bee03de494aa52dc7fbb37ba9a140ec7f84d3a9) luminous (rc)
1: (PG::get_with_id()+0x8d) [0x6616ad]
2: (PG::finish_recovery(std::list<Context*, std::allocator<Context*> >&)+0x15e) [0x662dfe]
3: (PG::RecoveryState::Clean::Clean(boost::statechart::state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na
, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history
_mode)0>::my_context)+0x185) [0x6862b5]
4: (boost::statechart::simple_state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(
boost::statechart::event_base const&, void const*)+0x18e) [0x6ea16e]
5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_even
t(boost::statechart::event_base const&)+0x6b) [0x6c7f5b]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_q
ueued_events()+0x91) [0x6c80b1]
7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1fe) [0x693b0e]
8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22a) [0x5e4b5a]
9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x642d47]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0xb3fcb9]
11: (ThreadPool::WorkThread::entry()+0x10) [0xb40c90]
12: (()+0x7dc5) [0xc5c8dc5]
13: (clone()+0x6d) [0xd50f73d]
</pre>
<p>like <a class="issue tracker-1 status-3 priority-7 priority-highest closed" title="Bug: pgid 0.7 has ref count of 2 (Resolved)" href="https://tracker.ceph.com/issues/20432#note-9">#20432-9</a>, the <code>on_safe</code> callbacks are not finished.</p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94455
2017-07-06T06:36:26Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><p>at the end of <code>OSD::process_peering_events()</code>, <code>dispatch_context(rctx, 0, curmap, &handle)</code> is called, which just deletes <code>on_safe</code> because pg is always nullptr here. so the contexts in <code>on_safe</code> are never finished.</p>
<pre>
-1974> 2017-07-06 04:14:47.686471 2a098700 10 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,
128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] needs_backfill does not need backfill
-1973> 2017-07-06 04:14:47.686909 2a098700 10 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] activate all replicas clean, no recovery
..
-1950> 2017-07-06 04:14:47.696257 2a098700 10 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] handle_peering_event: epoch_sent: 129 epoch_requested: 129 AllReplicasRecovered
...
-1941> 2017-07-06 04:14:47.700946 2a098700 5 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] enter Started/Primary/Active/Clean
</pre>
<p>but why the PG entered "Clean" even it had only one active OSD?</p>
<p>/a/kchai-2017-07-06_03:48:26-rados-wip-20432-kefu-distro-basic-mira/1365970/remote/*/log/ceph-osd.0.log.gz</p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94483
2017-07-06T09:20:17Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>Fix Under Review</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/16159">https://github.com/ceph/ceph/pull/16159</a></p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94515
2017-07-06T14:34:08Z
Kefu Chai
tchaikov@gmail.com
<ul></ul><pre>
-1216> 2017-07-06 12:46:55.396103 2c80e700 20 osd.5 pg_epoch: 672 pg[1.39( empty local-lis/les=663/666 n=0 ec=230/9 lis/c 663/580 les/c/f 666/582/0 663/663/663) [5] r=0 lpr=663 pi
=[580,663)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] get_with_id: 1.39 got id 77 (new) ref==11
...
-551> 2017-07-06 12:46:55.640280 3ade1700 20 osd.5 672 kicking pg 1.39
...
-5> 2017-07-06 12:46:57.861134 3ade1700 -1 osd.5 672 pgid 1.39 has ref count of 2
-4> 2017-07-06 12:46:57.862444 3ade1700 0 osd.5 pg_epoch: 672 pg[1.39( empty local-lis/les=663/666 n=0 ec=230/9 lis/c 663/580 les/c/f 666/582/0 663/663/663) [5] r=0 lpr=663 pi
=[580,663)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] dump_live_ids: 1.39 live ids:
-3> 2017-07-06 12:46:57.862792 3ade1700 0 osd.5 pg_epoch: 672 pg[1.39( empty local-lis/les=663/666 n=0 ec=230/9 lis/c 663/580 les/c/f 666/582/0 663/663/663) [5] r=0 lpr=663 pi
=[580,663)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] id: 77, ceph version 12.0.3-2567-gcfc638c (cfc638c70816b69357715ba31c5689c9ab862807) luminous (rc)
1: (PG::get_with_id()+0x8d) [0x66198d]
2: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x619) [0x5da209]
3: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xf15) [0x5fa195]
4: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8e9) [0xb3eff9]
5: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb41180]
6: (()+0x7dc5) [0xc5d7dc5]
7: (clone()+0x6d) [0xd51e76d]
</pre>
<p>OSDService::start_shutdown() does call recovery_sleep_timer.shutdown(), which in turn cancels and deletes all events callbacks. in we should be okay, but i will print more log to find out why (or if) the timer failed to clear the <code>recovery_requeue_callback</code> callback.</p>
<p>/a/kchai-2017-07-06_10:41:04-rados-wip-20432-kefu-distro-basic-smithi/1367534/remote/*/log/ceph-osd.5.log.gz</p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94604
2017-07-07T02:22:06Z
Sage Weil
sage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Resolved</i></li></ul>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94625
2017-07-07T05:49:57Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>Fix Under Review</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/16201">https://github.com/ceph/ceph/pull/16201</a></p>
<p>i swear: this is the last PR for this ticket!</p>
RADOS - Bug #20432: pgid 0.7 has ref count of 2
https://tracker.ceph.com/issues/20432?journal_id=94687
2017-07-07T14:42:16Z
Kefu Chai
tchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Resolved</i></li></ul>