Bug #20432: pgid 0.7 has ref count of 2 - RADOS - Ceph

Actions

Copy link

Bug #20432

closed

pgid 0.7 has ref count of 2

Added by Sage Weil almost 7 years ago. Updated almost 7 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Kefu Chai

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

    -9> 2017-06-27 08:51:50.589930 30db9700 20 osd.1 106  kicking pg 0.4c
    -8> 2017-06-27 08:51:50.590911 30db9700 30 osd.1 pg_epoch: 106 pg[0.4c( empty local-lis/les=90/92 n=0 ec=85/1 lis/c 90/90 les/c/f 92/92/0 90/90/14) [5,1] r=1 lpr=92 crt=0'0 active] lock
    -7> 2017-06-27 08:51:50.593668 30db9700 20 osd.1 106  kicking pg 0.26
    -6> 2017-06-27 08:51:50.593756 30db9700 30 osd.1 pg_epoch: 106 pg[0.26( v 106'8 (0'0,106'8] local-lis/les=13/15 n=0 ec=1/1 lis/c 13/13 les/c/f 15/15/0 13/13/12) [1,4] r=0 lpr=13 luod=0'0 crt=106'8 lcod 106'7 mlcod 0'0 active+clean] lock
    -5> 2017-06-27 08:51:50.600041 30db9700 20 osd.1 106  kicking pg 0.3d
    -4> 2017-06-27 08:51:50.600129 30db9700 30 osd.1 pg_epoch: 106 pg[0.3d( v 106'12 (0'0,106'12] local-lis/les=13/16 n=0 ec=1/1 lis/c 13/13 les/c/f 16/16/0 13/13/12) [1,4] r=0 lpr=13 luod=0'0 crt=106'12 lcod 106'11 mlcod 0'0 active+clean] lock
    -3> 2017-06-27 08:51:50.601188 30db9700 20 osd.1 106  kicking pg 0.7
    -2> 2017-06-27 08:51:50.601270 30db9700 30 osd.1 pg_epoch: 106 pg[0.7( v 106'2 (0'0,106'2] local-lis/les=62/64 n=0 ec=1/1 lis/c 62/62 les/c/f 64/64/0 56/62/12) [1,0] r=0 lpr=62 luod=0'0 crt=106'2 lcod 106'1 mlcod 0'0 active+clean] lock
    -1> 2017-06-27 08:51:50.602577 30db9700 -1 osd.1 106 pgid 0.7 has ref count of 2
     0> 2017-06-27 08:51:50.641868 30db9700 -1 *** Caught signal (Aborted) **
 in thread 30db9700 thread_name:signal_handler

 ceph version 12.0.3-2303-g5b3f1e7 (5b3f1e7e836a4b8acbb848ba94a838bfd6f71408) luminous (dev)
 1: (()+0x9c9879) [0xad1879]
 2: (()+0x10330) [0xb91d330]
 3: (gsignal()+0x37) [0xc79ec37]
 4: (abort()+0x148) [0xc7a2028]
 5: (OSD::shutdown()+0x18da) [0x5d49ea]
 6: (OSD::handle_signal(int)+0x10c) [0x5d4ddc]
 7: (SignalHandler::entry()+0x1b7) [0xad2b57]
 8: (()+0x8184) [0xb915184]
 9: (clone()+0x6d) [0xc86237d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/a/sage-2017-06-27_05:44:05-rados-wip-sage-testing-distro-basic-smithi/1331900

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Sage Weil almost 7 years ago

Related to Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty()) added

Actions

Copy link

Updated by Sage Weil almost 7 years ago

Assignee set to Kefu Chai

Actions

Copy link

Updated by Kefu Chai almost 7 years ago

Has duplicate Bug #20419: OSD aborts when shutting down added

Actions

Copy link

Updated by Kefu Chai almost 7 years ago

rerunning at http://pulpito.ceph.com/kchai-2017-06-30_06:49:46-rados-master-distro-basic-smithi/, if we can consistently reproduce this issue. i will compile a test build with "-DPG_DEBUG_REFS=ON", and try again.

Actions

Copy link

Updated by Kefu Chai almost 7 years ago

http://pulpito.ceph.com/kchai-2017-06-30_10:58:17-rados-wip-20432-kefu-distro-basic-smithi/

Actions

Copy link

Updated by Kefu Chai almost 7 years ago

  -779> 2017-06-30 11:21:33.800250 2a512700  5 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] enter Started/Primary/Active/Clean
  -778> 2017-06-30 11:21:33.800466 2a512700 10 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] finish_recovery
  -777> 2017-06-30 11:21:33.800671 2a512700 10 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] clear_recovery_state
  -776> 2017-06-30 11:21:33.813503 2a512700 20 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] get_with_id: 1.c got id 75 (new) ref==4
  -775> 2017-06-30 11:21:33.813838 2a512700 10 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] share_pg_info
  -774> 2017-06-30 11:21:33.814073 2a512700 15 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0 lpr=108 pi=[93,108)/1 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded] publish_stats_to_osd 110:48

so the reference held by C_PG_FinishRecovery was not relinquished.

Actions

Copy link

Updated by Sage Weil almost 7 years ago

bisecting this... so far i've narrowed it down to something between f43c5fa055386455a263802b0908ddc96a95b1b0 and e97210e70acec3f045689d19a3dc5811859534b9 (see wip-20432-? branches)

Actions

Copy link

Updated by Sage Weil almost 7 years ago

I suspect 0780f9e67801f400d78ac704c65caaa98e968bbc, which changed when the CLEAN flag was set at the end of recovery.

Actions

Copy link

Updated by Kefu Chai almost 7 years ago

    -4> 2017-06-30 11:21:42.425226 32522700  0 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0
lpr=108 pi=[93,108)/1 luod=0'0 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded]   dump_live_ids: 1.c live ids:
    -3> 2017-06-30 11:21:42.425475 32522700  0 osd.0 pg_epoch: 110 pg[1.c( v 107'8 (0'0,107'8] local-lis/les=108/110 n=0 ec=68/9 lis/c 108/93 les/c/f 110/94/0 108/108/108) [0] r=0
lpr=108 pi=[93,108)/1 luod=0'0 crt=107'8 lcod 107'7 mlcod 0'0 active+undersized+degraded]           id: 75, ceph version 12.0.3-2432-g41745b4 (41745b443ece9756a3eb8becb530f41552677
cdb) luminous (rc)
 1: (PG::get_with_id()+0x7a) [0x65e8ea]
 2: (PG::finish_recovery(std::list<Context*, std::allocator<Context*> >&)+0x14e) [0x65fede]
 3: (PG::RecoveryState::Clean::Clean(boost::statechart::state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na
, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history
_mode)0>::my_context)+0x185) [0x682705]
 4: (boost::statechart::simple_state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
 mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(
boost::statechart::event_base const&, void const*)+0x166) [0x6e3ab6]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_even
t(boost::statechart::event_base const&)+0x5b) [0x6c2b4b]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_q
ueued_events()+0x91) [0x6c2c81]
 7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1df) [0x68fa2f]
 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x212) [0x5e5f02]
 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x641637]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0xb1ce05]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xb1ddd0]
 12: (()+0x8184) [0xb920184]
 13: (clone()+0x6d) [0xc86d37d]

Actions

Copy link

#10

Updated by Kefu Chai almost 7 years ago

@Sage Weil,

i reverted the changes introduced by 0780f9e67801f400d78ac704c65caaa98e968bbc and tested the verify test at:

/a/kchai-2017-07-03_04:02:08-rados-wip-20432-kefu-distro-basic-smithi/1355515/remote/smithi130/log/ceph-osd.4.log.gz

the leak persists:

   -4> 2017-07-03 04:31:26.606726 3d5e6700  0 osd.4 pg_epoch: 128 pg[1.1( v 116'6 (0'0,116'6] local-lis/les=124/126 n=0 ec=10/10 lis/c 124/124 les/c/f 126/126/0 124/124/124) [4] r
=0 lpr=124 luod=0'0 crt=116'6 lcod 116'5 mlcod 0'0 active+undersized+degraded]      dump_live_ids: 1.1 live ids:
    -3> 2017-07-03 04:31:26.607146 3d5e6700  0 osd.4 pg_epoch: 128 pg[1.1( v 116'6 (0'0,116'6] local-lis/les=124/126 n=0 ec=10/10 lis/c 124/124 les/c/f 126/126/0 124/124/124) [4] r
=0 lpr=124 luod=0'0 crt=116'6 lcod 116'5 mlcod 0'0 active+undersized+degraded]              id: 105, ceph version 12.0.3-2485-g6d9d7e0 (6d9d7e0558897f932d9be3146041f9aec3ca452f) lu
minous (rc)
 1: (PG::get_with_id()+0x8d) [0x660dbd]
 2: (PG::queue_recovery(bool)+0x36e) [0x661a0e]
 3: (PG::RecoveryState::Active::react(PG::ActMap const&)+0x5a3) [0x688023]
 4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::st
atechart::event_base const&, void const*)+0x24c) [0x6e7c9c]
 5: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boos
t::statechart::event_base const&, void const*)+0x6a) [0x6e418a]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_even
t(boost::statechart::event_base const&)+0x6b) [0x6c764b]
 7: (PG::handle_activate_map(PG::RecoveryCtx*)+0x120) [0x694e30]
 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<TrackedIntPtr<PG>, std::less<TrackedIntPtr<PG> >, std::allocator<TrackedIntPtr<PG> > >*)+0
x71f) [0x5e3b3f]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x173) [0x5e4163]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x642447]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0xb3ead9]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xb3fab0]
 13: (()+0x7dc5) [0xc7e5dc5]
 14: (clone()+0x6d) [0xd72c76d]

Actions

Copy link

#11

Updated by Kefu Chai almost 7 years ago

seems PG::recovery_queued is reset somehow after being set in PG::queue_recovery(), but the PG is not removed from the OSDService::awaiting_throttle by PrimaryLogPG::on_change().

Actions

Copy link

#12

Updated by Kefu Chai almost 7 years ago


  -308> 2017-07-03 14:59:39.886566 35e4d700 20 osd.5 182  kicking pg 1.e
  -307> 2017-07-03 14:59:39.886610 35e4d700 30 osd.5 pg_epoch: 182 pg[1.e( v 168'4 (0'0,168'4] local-lis/les=169/171 n=0 ec=29/9 lis/c 169/169 les/c/f 171/171/0 169/169/169) [5] r=
0 lpr=169 rq=1 crt=168'4 lcod 168'3 mlcod 0'0 active+undersized+degraded] lock
...
   -98> 2017-07-03 14:59:42.057672 35e4d700  0 osd.5 pg_epoch: 182 pg[1.e(unlocked)]    dump_live_ids: 1.e live ids:
   -97> 2017-07-03 14:59:42.057820 35e4d700  0 osd.5 pg_epoch: 182 pg[1.e(unlocked)]            id: 125, ceph version 12.0.3-2487-g6dd471e (6dd471ec0c36b20ed3211316f796603625cb6c29) luminous (rc)
 1: (PG::get_with_id()+0x8d) [0x660dcd]
 2: (PG::queue_recovery(bool)+0x36e) [0x661a1e]
 3: (PG::RecoveryState::Active::react(PG::ActMap const&)+0x5a3) [0x688043]
 4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x24c) [0x6e7c8c]
 5: (boost::statechart::simple_state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x6a) [0x6e444a]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x6c763b]
 7: (PG::handle_activate_map(PG::RecoveryCtx*)+0x120) [0x694e50]
 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<TrackedIntPtr<PG>, std::less<TrackedIntPtr<PG> >, std::allocator<TrackedIntPtr<PG> > >*)+0x71f) [0x5e3b2f]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x173) [0x5e4153]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x642437]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0xb3ecc9]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xb3fca0]
 13: (()+0x7dc5) [0xc5c7dc5]
 14: (clone()+0x6d) [0xd50e73d]

see https://github.com/ceph/ceph-ci/commit/6dd471ec0c36b20ed3211316f796603625cb6c29

see /a/kchai-2017-07-03_14:30:36-rados-wip-20432-kefu-distro-basic-mira/1356142/remote/*/log/ceph-osd.5.log.gz

so pg 1.e was still in the "awaiting_throttle" queue when it was being kicked. so prbly, we should osd->clear_queued_recovery(this) also if (recovery_queued) in PrimaryLogPG::on_shutdown()

Actions

Copy link

#13

Updated by Kefu Chai almost 7 years ago

Status changed from 12 to Fix Under Review
Component(RADOS) OSD added

https://github.com/ceph/ceph/pull/16093

Actions

Copy link

#14

Updated by Josh Durgin almost 7 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

#15

Updated by Kefu Chai almost 7 years ago

Status changed from Resolved to 12

/a/kchai-2017-07-05_04:38:56-rados-wip-kefu-testing2-distro-basic-mira/1363113

    -3> 2017-07-05 06:35:34.518907 34899700 20 osd.4 157  kicking pg 1.1
    -2> 2017-07-05 06:35:34.521959 34899700 30 osd.4 pg_epoch: 157 pg[1.1( empty local-lis/les=155/156 n=0 ec=9/9 lis/c 155/141 les/c/f 156/142/0 155/155/155) [4] r=0 lpr=155 pi=[1
41,155)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] lock
    -1> 2017-07-05 06:35:34.523914 34899700 -1 osd.4 157 pgid 1.1 has ref count of 2
     0> 2017-07-05 06:35:34.626884 34899700 -1 *** Caught signal (Aborted) **
 in thread 34899700 thread_name:signal_handler

Actions

Copy link

#16

Updated by Kefu Chai almost 7 years ago

    -5> 2017-07-05 11:03:17.052787 31e89700 -1 osd.5 143 pgid 1.c has ref count of 2
    -4> 2017-07-05 11:03:17.054266 31e89700  0 osd.5 pg_epoch: 143 pg[1.c( empty local-lis/les=137/139 n=0 ec=65/9 lis/c 137/107 les/c/f 139/110/0 137/137/137) [5] r=0 lpr=137 pi=[
107,137)/1 crt=0'0 mlcod 0'0 active+undersized+degraded]    dump_live_ids: 1.c live ids:
    -3> 2017-07-05 11:03:17.054571 31e89700  0 osd.5 pg_epoch: 143 pg[1.c( empty local-lis/les=137/139 n=0 ec=65/9 lis/c 137/107 les/c/f 139/110/0 137/137/137) [5] r=0 lpr=137 pi=[
107,137)/1 crt=0'0 mlcod 0'0 active+undersized+degraded]            id: 35, ceph version 12.0.3-2523-g7bee03d (7bee03de494aa52dc7fbb37ba9a140ec7f84d3a9) luminous (rc)
 1: (PG::get_with_id()+0x8d) [0x6616ad]
 2: (PG::finish_recovery(std::list<Context*, std::allocator<Context*> >&)+0x15e) [0x662dfe]
 3: (PG::RecoveryState::Clean::Clean(boost::statechart::state<PG::RecoveryState::Clean, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na
, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history
_mode)0>::my_context)+0x185) [0x6862b5]
 4: (boost::statechart::simple_state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
 mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(
boost::statechart::event_base const&, void const*)+0x18e) [0x6ea16e]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_even
t(boost::statechart::event_base const&)+0x6b) [0x6c7f5b]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_q
ueued_events()+0x91) [0x6c80b1]
 7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1fe) [0x693b0e]
 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22a) [0x5e4b5a]
 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x642d47]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0xb3fcb9]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xb40c90]
 12: (()+0x7dc5) [0xc5c8dc5]
 13: (clone()+0x6d) [0xd50f73d]

like #20432-9, the on_safe callbacks are not finished.

Actions

Copy link

#17

Updated by Kefu Chai almost 7 years ago

at the end of OSD::process_peering_events(), dispatch_context(rctx, 0, curmap, &handle) is called, which just deletes on_safe because pg is always nullptr here. so the contexts in on_safe are never finished.

 -1974> 2017-07-06 04:14:47.686471 2a098700 10 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,
128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] needs_backfill does not need backfill
 -1973> 2017-07-06 04:14:47.686909 2a098700 10 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] activate all replicas clean, no recovery
..
 -1950> 2017-07-06 04:14:47.696257 2a098700 10 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] handle_peering_event: epoch_sent: 129 epoch_requested: 129 AllReplicasRecovered
 ...

 -1941> 2017-07-06 04:14:47.700946 2a098700  5 osd.0 pg_epoch: 129 pg[1.4( empty local-lis/les=128/129 n=0 ec=9/9 lis/c 128/76 les/c/f 129/80/0 128/128/128) [0] r=0 lpr=128 pi=[76,128)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] enter Started/Primary/Active/Clean

but why the PG entered "Clean" even it had only one active OSD?

/a/kchai-2017-07-06_03:48:26-rados-wip-20432-kefu-distro-basic-mira/1365970/remote/*/log/ceph-osd.0.log.gz

Actions

Copy link

#18

Updated by Kefu Chai almost 7 years ago

Status changed from 12 to Fix Under Review

https://github.com/ceph/ceph/pull/16159

Actions

Copy link

#19

Updated by Kefu Chai almost 7 years ago

 -1216> 2017-07-06 12:46:55.396103 2c80e700 20 osd.5 pg_epoch: 672 pg[1.39( empty local-lis/les=663/666 n=0 ec=230/9 lis/c 663/580 les/c/f 666/582/0 663/663/663) [5] r=0 lpr=663 pi
=[580,663)/1 crt=0'0 mlcod 0'0 active+undersized+degraded] get_with_id: 1.39 got id 77 (new) ref==11
...
  -551> 2017-07-06 12:46:55.640280 3ade1700 20 osd.5 672  kicking pg 1.39
...
    -5> 2017-07-06 12:46:57.861134 3ade1700 -1 osd.5 672 pgid 1.39 has ref count of 2
    -4> 2017-07-06 12:46:57.862444 3ade1700  0 osd.5 pg_epoch: 672 pg[1.39( empty local-lis/les=663/666 n=0 ec=230/9 lis/c 663/580 les/c/f 666/582/0 663/663/663) [5] r=0 lpr=663 pi
=[580,663)/1 crt=0'0 mlcod 0'0 active+undersized+degraded]  dump_live_ids: 1.39 live ids:
    -3> 2017-07-06 12:46:57.862792 3ade1700  0 osd.5 pg_epoch: 672 pg[1.39( empty local-lis/les=663/666 n=0 ec=230/9 lis/c 663/580 les/c/f 666/582/0 663/663/663) [5] r=0 lpr=663 pi
=[580,663)/1 crt=0'0 mlcod 0'0 active+undersized+degraded]          id: 77, ceph version 12.0.3-2567-gcfc638c (cfc638c70816b69357715ba31c5689c9ab862807) luminous (rc)
 1: (PG::get_with_id()+0x8d) [0x66198d]
 2: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x619) [0x5da209]
 3: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xf15) [0x5fa195]
 4: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8e9) [0xb3eff9]
 5: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb41180]
 6: (()+0x7dc5) [0xc5d7dc5]
 7: (clone()+0x6d) [0xd51e76d]

OSDService::start_shutdown() does call recovery_sleep_timer.shutdown(), which in turn cancels and deletes all events callbacks. in we should be okay, but i will print more log to find out why (or if) the timer failed to clear the recovery_requeue_callback callback.

/a/kchai-2017-07-06_10:41:04-rados-wip-20432-kefu-distro-basic-smithi/1367534/remote/*/log/ceph-osd.5.log.gz

Actions

Copy link

#20