Bug #24866
closedFAILED assert(0 == "past_interval start interval mismatch") in check_past_interval_bounds()
0%
Description
2018-06-26T18:42:05.440 INFO:tasks.ceph.osd.0.smithi198.stderr:/build/ceph-14.0.0-834-gf206106/src/osd/PG.cc: In function 'void PG::check_past_interval_bounds() const' thread 7fc1664b0700 time 2018-06-26 18:42:05.448270 2018-06-26T18:42:05.440 INFO:tasks.ceph.osd.0.smithi198.stderr:/build/ceph-14.0.0-834-gf206106/src/osd/PG.cc: 932: FAILED assert(0 == "past_interval start interval mismatch") 2018-06-26T18:42:05.441 INFO:tasks.ceph.osd.0.smithi198.stderr:2018-06-26 18:42:05.446 7fc1664b0700 -1 log_channel(cluster) log [ERR] : 5.9 past_intervals [104,132) start interval does not contain the required bound [71,132) start 2018-06-26T18:42:05.441 INFO:tasks.ceph.osd.0.smithi198.stderr:2018-06-26 18:42:05.446 7fc1664b0700 -1 osd.0 pg_epoch: 132 pg[5.9( v 103'293 (0'0,103'293] local-lis/les=69/70 n=34 ec=64/64 lis/c 69/69 les/c/f 70/71/0 132/132/132) [3,5] r=-1 lpr=132 pi=[104,132)/1 crt=103'293 lcod 0'0 unknown NOTIFY mbc={}] 5.9 past_intervals [104,132) start interval does not contain the required bound [71,132) start 2018-06-26T18:42:05.442 INFO:tasks.rados.rados.0.smithi098.stdout:2207: writing smithi09812089-376 from 571834 to 1325648 tid 1 2018-06-26T18:42:05.443 INFO:tasks.ceph.osd.0.smithi198.stderr: ceph version 14.0.0-834-gf206106 (f206106a05c4e7b9136f6e05d7a71e8d398b7d35) nautilus (dev) 2018-06-26T18:42:05.443 INFO:tasks.ceph.osd.0.smithi198.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fc18ffabaee] 2018-06-26T18:42:05.444 INFO:tasks.ceph.osd.0.smithi198.stderr: 2: (()+0x270c77) [0x7fc18ffabc77] 2018-06-26T18:42:05.444 INFO:tasks.ceph.osd.0.smithi198.stderr: 3: (PG::check_past_interval_bounds() const+0xac3) [0x557ae1c8a7a3] 2018-06-26T18:42:05.444 INFO:tasks.ceph.osd.0.smithi198.stderr: 4: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1bf) [0x557ae1cc0e4f] 2018-06-26T18:42:05.444 INFO:tasks.ceph.osd.0.smithi198.stderr: 5: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x140) [0x557ae1d0a180] 2018-06-26T18:42:05.445 INFO:tasks.ceph.osd.0.smithi198.stderr: 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xb3) [0x557ae1ce33a3] 2018-06-26T18:42:05.445 INFO:tasks.ceph.osd.0.smithi198.stderr: 7: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x283) [0x557ae1ca2c23] 2018-06-26T18:42:05.446 INFO:tasks.ceph.osd.0.smithi198.stderr: 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x29d) [0x557ae1c0962d] 2018-06-26T18:42:05.446 INFO:tasks.ceph.osd.0.smithi198.stderr: 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1a1) [0x557ae1c09cf1] 2018-06-26T18:42:05.446 INFO:tasks.ceph.osd.0.smithi198.stderr: 10: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x557ae1e75870] 2018-06-26T18:42:05.446 INFO:tasks.ceph.osd.0.smithi198.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x598) [0x557ae1c19f38] 2018-06-26T18:42:05.447 INFO:tasks.ceph.osd.0.smithi198.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x46e) [0x7fc18ffb0a0e] 2018-06-26T18:42:05.447 INFO:tasks.ceph.osd.0.smithi198.stderr: 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fc18ffb2a90] 2018-06-26T18:42:05.447 INFO:tasks.ceph.osd.0.smithi198.stderr: 14: (()+0x76db) [0x7fc18e6a16db] 2018-06-26T18:42:05.447 INFO:tasks.ceph.osd.0.smithi198.stderr: 15: (clone()+0x3f) [0x7fc18d66588f] 2018-06-26T18:42:05.448 INFO:tasks.ceph.osd.0.smithi198.stderr:*** Caught signal (Aborted) ** 2018-06-26T18:42:05.448 INFO:tasks.ceph.osd.0.smithi198.stderr: in thread 7fc1664b0700 thread_name:tp_osd_tp
/a/sage-2018-06-26_15:58:10-rados-wip-sage3-testing-2018-06-26-0616-distro-basic-smithi/2706875
Files
Updated by Neha Ojha almost 6 years ago
- Status changed from New to 12
- Priority changed from Normal to High
/a/nojha-2018-07-06_23:31:26-rados-wip-23979-2018-07-06-distro-basic-smithi/2744661/
Updated by Josh Durgin over 5 years ago
- Has duplicate Bug #25017: log [ERR] : 1.3 past_intervals [182,196) start interval does not contain the required bound [162,196) start added
Updated by Dexter John Genterone over 5 years ago
We encountered this issue after trying out a patch for https://tracker.ceph.com/issues/21142.
Is it safe to bypass check_past_interval_bounds() ?
Updated by Dexter John Genterone over 5 years ago
- File computestore17.log.zip computestore17.log.zip added
attaching OSD log.
Updated by Kuba StaĆczak over 5 years ago
Guys, is there a way for an OSD to recover from this error?
Updated by Sage Weil over 5 years ago
- Priority changed from High to Urgent
Generally yes, but I havne't been able to reproduce to test a solution. I take it this has happened to you?
I'm hoping to get logs so I can understand the root cause from Dexter. My (limited) understanding is that it is a combination of nearfull/full cluster/pool and the addition of OSDs that leads to the problem.
Updated by Sage Weil over 5 years ago
- Status changed from Need More Info to Resolved
resolved by https://github.com/ceph/ceph/pull/24064
Updated by Neha Ojha over 5 years ago
Seeing this on master.
2018-10-17T00:04:44.239 INFO:tasks.ceph.osd.1.smithi203.stderr:2018-10-17 00:04:44.267 7f1fb6327700 -1 log_channel(cluster) log [ERR] : 2.4s2 past_intervals [977,1089) start interval does not contain the required bound [906,1089) start 2018-10-17T00:04:44.239 INFO:tasks.ceph.osd.1.smithi203.stderr:2018-10-17 00:04:44.267 7f1fb6327700 -1 osd.1 pg_epoch: 1089 pg[2.4s2( v 845'142 (845'142,845'142] lb MIN (NIBBLEWISE) local-lis/les=0/906 n=5 ec=665/19 lis/c 977/977 les/c/f 906/906/0 882/1089/1087) [5,6,2]p6(1) r=-1 lpr=1089 pi=[977,1089)/2 crt=845'142 lcod 0'0 unknown NOTIFY mbc={}] 2.4s2 past_intervals [977,1089) start interval does not contain the required bound [906,1089) start 2018-10-17T00:04:44.240 INFO:tasks.ceph.osd.0.smithi203.stderr:2018-10-17 00:04:44.268 7fe4917ff700 -1 log_channel(cluster) log [ERR] : 2.4s0 past_intervals [977,1089) start interval does not contain the required bound [906,1089) start 2018-10-17T00:04:44.241 INFO:tasks.ceph.osd.0.smithi203.stderr:2018-10-17 00:04:44.268 7fe4917ff700 -1 osd.0 pg_epoch: 1089 pg[2.4s0( v 845'142 (845'142,845'142] lb MIN (NIBBLEWISE) local-lis/les=0/906 n=5 ec=665/19 lis/c 977/977 les/c/f 906/906/0 882/1089/1087) [5,6,2]p6(1) r=-1 lpr=1089 pi=[977,1089)/2 crt=845'142 lcod 0'0 unknown NOTIFY mbc={}] 2.4s0 past_intervals [977,1089) start interval does not contain the required bound [906,1089) start 2018-10-17T00:04:44.241 INFO:tasks.ceph.osd.1.smithi203.stderr:past_interval start interval mismatch ceph version 14.0.0-4233-g8e2ed77 (8e2ed7796c88aec03322f396b75446e4435088e0) nautilus (dev) 2018-10-17T00:04:44.242 INFO:tasks.ceph.osd.1.smithi203.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x82) [0x5616bead7004] 2018-10-17T00:04:44.242 INFO:tasks.ceph.osd.1.smithi203.stderr: 2: (PG::check_past_interval_bounds() const+0xbf1) [0x5616bec9cd31] 2018-10-17T00:04:44.242 INFO:tasks.ceph.osd.1.smithi203.stderr: 3: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1e0) [0x5616becc4780] 2018-10-17T00:04:44.242 INFO:tasks.ceph.osd.1.smithi203.stderr: 4: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x75) [0x5616bed165e5] 2018-10-17T00:04:44.243 INFO:tasks.ceph.osd.1.smithi203.stderr: 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x96) [0x5616becf8c36] 2018-10-17T00:04:44.243 INFO:tasks.ceph.osd.1.smithi203.stderr: 6: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a8) [0x5616becc3f58] 2018-10-17T00:04:44.243 INFO:tasks.ceph.osd.1.smithi203.stderr: 7: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x2d1) [0x5616bec22b11] 2018-10-17T00:04:44.243 INFO:tasks.ceph.osd.1.smithi203.stderr: 8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x9e) [0x5616bec2446e] 2018-10-17T00:04:44.244 INFO:tasks.ceph.osd.1.smithi203.stderr: 9: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x5616bee8bc70] 2018-10-17T00:04:44.244 INFO:tasks.ceph.osd.1.smithi203.stderr: 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaec) [0x5616bec1924c] 2018-10-17T00:04:44.244 INFO:tasks.ceph.osd.1.smithi203.stderr: 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3e6) [0x5616bf234766] 2018-10-17T00:04:44.244 INFO:tasks.ceph.osd.1.smithi203.stderr: 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5616bf23bf00] 2018-10-17T00:04:44.245 INFO:tasks.ceph.osd.1.smithi203.stderr: 13: (()+0x7dd5) [0x7f1fd9efadd5] 2018-10-17T00:04:44.245 INFO:tasks.ceph.osd.1.smithi203.stderr: 14: (clone()+0x6d) [0x7f1fd8dc3b3d]
/a/nojha-2018-10-16_22:54:08-rados-master-distro-basic-smithi/3149064/
Updated by Josh Durgin about 5 years ago
- Status changed from Resolved to 12
Looks like this is still possible to hit:
-325> 2019-02-08 09:14:54.561 7fe906ee1700 -1 log_channel(cluster) log [ERR] : 2.d past_intervals [595,597) start interval does not contain the required bound [594,597) start -324> 2019-02-08 09:14:54.561 7fe906ee1700 -1 osd.5 pg_epoch: 597 pg[2.d( v 582'947 (582'947,582'947] lb MIN (NIBBLEWISE) local-lis/les=0/594 n=27 ec=521/17 lis/c 595/595 les/c/f 594/594/0 591/597/595) [4,1]/[4,1,2] r=-1 lpr=597 pi=[595,597)/1 crt=582'947 lcod 0'0 remapped NOTIFY mbc={}] 2.d past_intervals [595,597) start interval does not contain the required bound [594,597) start
Updated by Sage Weil about 5 years ago
/a/sage-2019-02-13_00:42:53-rados-wip-sage2-testing-2019-02-12-1700-distro-basic-smithi/3581667
description: rados/thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/{default.yaml} backoff/peering_and_degraded.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/crush-compat.yaml msgr-failures/few.yaml msgr/random.yaml objectstore/bluestore-comp.yaml rados.yaml rocksdb.yaml supported-random-distro$/{ubuntu_latest.yaml} thrashers/careful.yaml thrashosds-health.yaml workloads/admin_socket_objecter_requests.yaml}
Updated by Sage Weil about 5 years ago
- Status changed from 12 to Fix Under Review
Updated by Sage Weil about 5 years ago
-4> 2019-02-13 02:03:10.110 7f65c7fd9700 10 osd.1 pg_epoch: 465 pg[2.6( v 452'31657 (452'31657,452'31657] lb MIN (NIBBLEWISE) local-lis/les=0/459 n=23 ec=335/19 lis/c 460/460 les/c/f 459/459/0 456/465/462) [4,0] r=-1 lpr=465 pi=[460,465)/2 crt=452'31657 lcod 0'0 unknown NOTIFY mbc={}] check_recovery_sources no source osds () went down -3> 2019-02-13 02:03:10.110 7f65c7fd9700 -1 log_channel(cluster) log [ERR] : 2.6 past_intervals [460,465) start interval does not contain the required bound [459,465) start
note that last_epoch_clean is 459 but interval starts in 460.
root cause is the source pg 2.16 went clean in 460, but the target went clean in 459, and the former is what is reported to the mon.
b79442efce479fde7f8bda8fdadf86e91003a327 tried to fix this but compared last_epoch_clean before it was assigned.
Updated by Sage Weil about 5 years ago
- Status changed from Fix Under Review to Resolved