Project

General

Profile

Bug #35955

Updated by Sage Weil over 5 years ago

<pre> 
 
  -4044> 2018-09-12 13:57:55.604 7fba7575c700 -1 13:57:55.611 7fba75f5d700    5 osd.1 pg_epoch: 638 pg[2.4( 677 pg[6.b( v 18'8 (0'0,18'8] local-lis/les=482/483 n=5 ec=15/15 630'1138 (0'0,630'1138] local-lis/les=503/504 n=4 ec=202/202 lis/c 482/482 503/503 les/c/f 483/483/0 628/628/628) [2,0] 504/505/0 638/638/638) [0] r=-1 lpr=637 pi=[584,628)/1 crt=18'8 lpr=638 pi=[503,638)/1 crt=630'1138 lcod 0'0 unknown mbc={}] 2.4 past_intervals [584,628) start interval does not contain the required bound [4 
 83,628) start 
 NOTIFY mbc={} ps=[1~1a9]] exit Reset 0.052499 42 0.001797 
  -4043> 2018-09-12 13:57:55.607 13:57:55.611 7fba7575c700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-3159-g47aa991/rpm/el7/BUILD/ceph-14.0.0-3159-g47aa991/src/osd/PG.cc: In function 'void PG::check_past_interval_bounds( 
 ) const' *** Caught signal (Aborted) ** 
  in thread 7fba7575c700 time thread_name:tp_osd_tp 
  -4044> 2018-09-12 13:57:55.606349 
 13:57:55.611 7fba75f5d700    5 osd.1 pg_epoch: 677 pg[6.b( v 630'1138 (0'0,630'1138] local-lis/les=503/504 n=4 ec=202/202 lis/c 503/503 les/c/f 504/505/0 638/638/638) [0] r=-1 lpr=638 pi=[503,638)/1 crt=630'1138 lcod 0'0 unknown NOTIFY mbc={} ps=[1~1a9]] exit Reset 0.052499 42 0.001797 
  -4043> 2018-09-12 13:57:55.607 13:57:55.611 7fba7575c700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-3159-g47aa991/rpm/el7/BUILD/ceph-14.0.0-3159-g47aa991/src/osd/PG.cc: In function 'void PG::check_past_interval_bounds() const' *** Caught signal (Aborted) ** 
  in thread 7fba7575c700 time 2018-09-12 13:57:55.606349 
 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-3159-g47aa991/rpm/el7/BUILD/ceph-14.0.0-3159-g47aa991/src/osd/PG.cc: 932: abort() 

  ceph version 14.0.0-3159-g47aa991 (47aa99112f9268c11a435dca151002cf33e5e98f) nautilus (dev) 
  1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x82) [0x5630160a042e] 
  2: (PG::check_past_interval_bounds() const+0xa57) [0x563016253917] 
  3: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1b2) [0x56301627fa02] 
  4: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x75) [0x5630162c2835] 
  5: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x23d) [0x56301626bffd] 
  6: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x2d1) [0x5630161df011] 
  7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x9b) [0x5630161e05db] 
  8: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x56301642abb0] 
  9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x4fc) [0x5630161d349c] 
  10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d6) [0x563016759996] 
  11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56301675a590] 
  12: (()+0x7e25) [0x7fbaa0148e25] thread_name:tp_osd_tp 
 </pre> 
 the pg was imported.    originally exported from osd.5.    last seen in this state: 
 <pre> 
 2018-09-12 13:56:18.136960 7fb75d8b7700 10 osd.5 pg_epoch: 582 pg[2.4( v 18'8 (0'0,18'8] local-lis/les=482/483 n=5 ec=15/15 lis/c 482/482 les/c/f 483/483/0 482/482/407) [5,0] r=0 lpr=482 crt=18'8 mlcod 18'8 active+clean] handle_peering_event: epoch_sent: 582 epoch_requested: 582 NullEvt 
 </pre> 
 /a/sage-2018-09-11_22:11:25-rados-wip-sage-testing-2018-09-11-1316-distro-basic-smithi/3006960 

 the problem appears to be that the past intervals added by ceph-objectstore-tool don't match the expected bounds, which are based on last_epoch_clean (483).

Back