Actions
Bug #35955
closedceph-objectstore-tool past_intervals broken
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2018-09-12 13:57:55.604 7fba7575c700 -1 osd.1 pg_epoch: 638 pg[2.4( v 18'8 (0'0,18'8] local-lis/les=482/483 n=5 ec=15/15 lis/c 482/482 les/c/f 483/483/0 628/628/628) [2,0] r=-1 lpr=637 pi=[584,628)/1 crt=18'8 lcod 0'0 unknown mbc={}] 2.4 past_intervals [584,628) start interval does not contain the required bound [4 83,628) start 2018-09-12 13:57:55.607 7fba7575c700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-3159-g47aa991/rpm/el7/BUILD/ceph-14.0.0-3159-g47aa991/src/osd/PG.cc: In function 'void PG::check_past_interval_bounds( ) const' thread 7fba7575c700 time 2018-09-12 13:57:55.606349 2018-09-12 13:57:55.607 7fba7575c700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-3159-g47aa991/rpm/el7/BUILD/ceph-14.0.0-3159-g47aa991/src/osd/PG.cc: In function 'void PG::check_past_interval_bounds() const' thread 7fba7575c700 time 2018-09-12 13:57:55.606349 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.0-3159-g47aa991/rpm/el7/BUILD/ceph-14.0.0-3159-g47aa991/src/osd/PG.cc: 932: abort() ceph version 14.0.0-3159-g47aa991 (47aa99112f9268c11a435dca151002cf33e5e98f) nautilus (dev) 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x82) [0x5630160a042e] 2: (PG::check_past_interval_bounds() const+0xa57) [0x563016253917] 3: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1b2) [0x56301627fa02] 4: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x75) [0x5630162c2835] 5: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x23d) [0x56301626bffd] 6: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x2d1) [0x5630161df011] 7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x9b) [0x5630161e05db] 8: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x56301642abb0] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x4fc) [0x5630161d349c] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d6) [0x563016759996] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56301675a590] 12: (()+0x7e25) [0x7fbaa0148e25]
the pg was imported. originally exported from osd.5. last seen in this state:
2018-09-12 13:56:18.136960 7fb75d8b7700 10 osd.5 pg_epoch: 582 pg[2.4( v 18'8 (0'0,18'8] local-lis/les=482/483 n=5 ec=15/15 lis/c 482/482 les/c/f 483/483/0 482/482/407) [5,0] r=0 lpr=482 crt=18'8 mlcod 18'8 active+clean] handle_peering_event: epoch_sent: 582 epoch_requested: 582 NullEvt
/a/sage-2018-09-11_22:11:25-rados-wip-sage-testing-2018-09-11-1316-distro-basic-smithi/3006960
the problem appears to be that the past intervals added by ceph-objectstore-tool don't match the expected bounds, which are based on last_epoch_clean (483).
Updated by Sage Weil over 5 years ago
This is fixed for nautilus since the behavior totally changed with https://github.com/ceph/ceph/pull/23985. The problem may still exist in mimic, luminous, etc., but until we reproduce it there I'm not sure if it's the same bug or a regression that happened post-mimic.
Updated by David Zafman over 5 years ago
- Related to Bug #36412: ceph-objectstore-tool import after pg splits which will lost objects added
Actions