Project

General

Profile

Actions

Bug #24866

closed

FAILED assert(0 == "past_interval start interval mismatch") in check_past_interval_bounds()

Added by Neha Ojha almost 6 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-06-26T18:42:05.440 INFO:tasks.ceph.osd.0.smithi198.stderr:/build/ceph-14.0.0-834-gf206106/src/osd/PG.cc: In function 'void PG::check_past_interval_bounds() const' thread 7fc1664b0700 time 2018-06-26 18:42:05.448270
2018-06-26T18:42:05.440 INFO:tasks.ceph.osd.0.smithi198.stderr:/build/ceph-14.0.0-834-gf206106/src/osd/PG.cc: 932: FAILED assert(0 == "past_interval start interval mismatch")
2018-06-26T18:42:05.441 INFO:tasks.ceph.osd.0.smithi198.stderr:2018-06-26 18:42:05.446 7fc1664b0700 -1 log_channel(cluster) log [ERR] : 5.9 past_intervals [104,132) start interval does not contain the required bound [71,132) start
2018-06-26T18:42:05.441 INFO:tasks.ceph.osd.0.smithi198.stderr:2018-06-26 18:42:05.446 7fc1664b0700 -1 osd.0 pg_epoch: 132 pg[5.9( v 103'293 (0'0,103'293] local-lis/les=69/70 n=34 ec=64/64 lis/c 69/69 les/c/f 70/71/0 132/132/132) [3,5] r=-1 lpr=132 pi=[104,132)/1 crt=103'293 lcod 0'0 unknown NOTIFY mbc={}] 5.9 past_intervals [104,132) start interval does not contain the required bound [71,132) start
2018-06-26T18:42:05.442 INFO:tasks.rados.rados.0.smithi098.stdout:2207:  writing smithi09812089-376 from 571834 to 1325648 tid 1
2018-06-26T18:42:05.443 INFO:tasks.ceph.osd.0.smithi198.stderr: ceph version 14.0.0-834-gf206106 (f206106a05c4e7b9136f6e05d7a71e8d398b7d35) nautilus (dev)
2018-06-26T18:42:05.443 INFO:tasks.ceph.osd.0.smithi198.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fc18ffabaee]
2018-06-26T18:42:05.444 INFO:tasks.ceph.osd.0.smithi198.stderr: 2: (()+0x270c77) [0x7fc18ffabc77]
2018-06-26T18:42:05.444 INFO:tasks.ceph.osd.0.smithi198.stderr: 3: (PG::check_past_interval_bounds() const+0xac3) [0x557ae1c8a7a3]
2018-06-26T18:42:05.444 INFO:tasks.ceph.osd.0.smithi198.stderr: 4: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1bf) [0x557ae1cc0e4f]
2018-06-26T18:42:05.444 INFO:tasks.ceph.osd.0.smithi198.stderr: 5: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x140) [0x557ae1d0a180]
2018-06-26T18:42:05.445 INFO:tasks.ceph.osd.0.smithi198.stderr: 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xb3) [0x557ae1ce33a3]
2018-06-26T18:42:05.445 INFO:tasks.ceph.osd.0.smithi198.stderr: 7: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x283) [0x557ae1ca2c23]
2018-06-26T18:42:05.446 INFO:tasks.ceph.osd.0.smithi198.stderr: 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x29d) [0x557ae1c0962d]
2018-06-26T18:42:05.446 INFO:tasks.ceph.osd.0.smithi198.stderr: 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1a1) [0x557ae1c09cf1]
2018-06-26T18:42:05.446 INFO:tasks.ceph.osd.0.smithi198.stderr: 10: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x557ae1e75870]
2018-06-26T18:42:05.446 INFO:tasks.ceph.osd.0.smithi198.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x598) [0x557ae1c19f38]
2018-06-26T18:42:05.447 INFO:tasks.ceph.osd.0.smithi198.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x46e) [0x7fc18ffb0a0e]
2018-06-26T18:42:05.447 INFO:tasks.ceph.osd.0.smithi198.stderr: 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fc18ffb2a90]
2018-06-26T18:42:05.447 INFO:tasks.ceph.osd.0.smithi198.stderr: 14: (()+0x76db) [0x7fc18e6a16db]
2018-06-26T18:42:05.447 INFO:tasks.ceph.osd.0.smithi198.stderr: 15: (clone()+0x3f) [0x7fc18d66588f]
2018-06-26T18:42:05.448 INFO:tasks.ceph.osd.0.smithi198.stderr:*** Caught signal (Aborted) **
2018-06-26T18:42:05.448 INFO:tasks.ceph.osd.0.smithi198.stderr: in thread 7fc1664b0700 thread_name:tp_osd_tp

/a/sage-2018-06-26_15:58:10-rados-wip-sage3-testing-2018-06-26-0616-distro-basic-smithi/2706875


Files

computestore17.log.zip (96.1 KB) computestore17.log.zip Dexter John Genterone, 08/07/2018 06:21 PM

Related issues 1 (0 open1 closed)

Has duplicate RADOS - Bug #25017: log [ERR] : 1.3 past_intervals [182,196) start interval does not contain the required bound [162,196) startDuplicate07/20/2018

Actions
Actions #1

Updated by Neha Ojha almost 6 years ago

  • Status changed from New to 12
  • Priority changed from Normal to High

/a/nojha-2018-07-06_23:31:26-rados-wip-23979-2018-07-06-distro-basic-smithi/2744661/

Actions #2

Updated by Josh Durgin over 5 years ago

  • Has duplicate Bug #25017: log [ERR] : 1.3 past_intervals [182,196) start interval does not contain the required bound [162,196) start added
Actions #3

Updated by Dexter John Genterone over 5 years ago

We encountered this issue after trying out a patch for https://tracker.ceph.com/issues/21142.

Is it safe to bypass check_past_interval_bounds() ?

Actions #4

Updated by Dexter John Genterone over 5 years ago

attaching OSD log.

Actions #5

Updated by Kuba StaƄczak over 5 years ago

Guys, is there a way for an OSD to recover from this error?

Actions #6

Updated by Sage Weil over 5 years ago

  • Priority changed from High to Urgent

Generally yes, but I havne't been able to reproduce to test a solution. I take it this has happened to you?

I'm hoping to get logs so I can understand the root cause from Dexter. My (limited) understanding is that it is a combination of nearfull/full cluster/pool and the addition of OSDs that leads to the problem.

Actions #7

Updated by Neha Ojha over 5 years ago

  • Status changed from 12 to Need More Info
Actions #8

Updated by Sage Weil over 5 years ago

  • Status changed from Need More Info to Resolved
Actions #9

Updated by Neha Ojha over 5 years ago

Seeing this on master.

2018-10-17T00:04:44.239 INFO:tasks.ceph.osd.1.smithi203.stderr:2018-10-17 00:04:44.267 7f1fb6327700 -1 log_channel(cluster) log [ERR] : 2.4s2 past_intervals [977,1089) start interval does not contain the required bound [906,1089) start
2018-10-17T00:04:44.239 INFO:tasks.ceph.osd.1.smithi203.stderr:2018-10-17 00:04:44.267 7f1fb6327700 -1 osd.1 pg_epoch: 1089 pg[2.4s2( v 845'142 (845'142,845'142] lb MIN (NIBBLEWISE) local-lis/les=0/906 n=5 ec=665/19 lis/c 977/977 les/c/f 906/906/0 882/1089/1087) [5,6,2]p6(1) r=-1 lpr=1089 pi=[977,1089)/2 crt=845'142 lcod 0'0 unknown NOTIFY mbc={}] 2.4s2 past_intervals [977,1089) start interval does not contain the required bound [906,1089) start
2018-10-17T00:04:44.240 INFO:tasks.ceph.osd.0.smithi203.stderr:2018-10-17 00:04:44.268 7fe4917ff700 -1 log_channel(cluster) log [ERR] : 2.4s0 past_intervals [977,1089) start interval does not contain the required bound [906,1089) start
2018-10-17T00:04:44.241 INFO:tasks.ceph.osd.0.smithi203.stderr:2018-10-17 00:04:44.268 7fe4917ff700 -1 osd.0 pg_epoch: 1089 pg[2.4s0( v 845'142 (845'142,845'142] lb MIN (NIBBLEWISE) local-lis/les=0/906 n=5 ec=665/19 lis/c 977/977 les/c/f 906/906/0 882/1089/1087) [5,6,2]p6(1) r=-1 lpr=1089 pi=[977,1089)/2 crt=845'142 lcod 0'0 unknown NOTIFY mbc={}] 2.4s0 past_intervals [977,1089) start interval does not contain the required bound [906,1089) start
2018-10-17T00:04:44.241 INFO:tasks.ceph.osd.1.smithi203.stderr:past_interval start interval mismatch ceph version 14.0.0-4233-g8e2ed77 (8e2ed7796c88aec03322f396b75446e4435088e0) nautilus (dev)
2018-10-17T00:04:44.242 INFO:tasks.ceph.osd.1.smithi203.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x82) [0x5616bead7004]
2018-10-17T00:04:44.242 INFO:tasks.ceph.osd.1.smithi203.stderr: 2: (PG::check_past_interval_bounds() const+0xbf1) [0x5616bec9cd31]
2018-10-17T00:04:44.242 INFO:tasks.ceph.osd.1.smithi203.stderr: 3: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1e0) [0x5616becc4780]
2018-10-17T00:04:44.242 INFO:tasks.ceph.osd.1.smithi203.stderr: 4: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x75) [0x5616bed165e5]
2018-10-17T00:04:44.243 INFO:tasks.ceph.osd.1.smithi203.stderr: 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x96) [0x5616becf8c36]
2018-10-17T00:04:44.243 INFO:tasks.ceph.osd.1.smithi203.stderr: 6: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x2a8) [0x5616becc3f58]
2018-10-17T00:04:44.243 INFO:tasks.ceph.osd.1.smithi203.stderr: 7: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x2d1) [0x5616bec22b11]
2018-10-17T00:04:44.243 INFO:tasks.ceph.osd.1.smithi203.stderr: 8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x9e) [0x5616bec2446e]
2018-10-17T00:04:44.244 INFO:tasks.ceph.osd.1.smithi203.stderr: 9: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x5616bee8bc70]
2018-10-17T00:04:44.244 INFO:tasks.ceph.osd.1.smithi203.stderr: 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaec) [0x5616bec1924c]
2018-10-17T00:04:44.244 INFO:tasks.ceph.osd.1.smithi203.stderr: 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3e6) [0x5616bf234766]
2018-10-17T00:04:44.244 INFO:tasks.ceph.osd.1.smithi203.stderr: 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5616bf23bf00]
2018-10-17T00:04:44.245 INFO:tasks.ceph.osd.1.smithi203.stderr: 13: (()+0x7dd5) [0x7f1fd9efadd5]
2018-10-17T00:04:44.245 INFO:tasks.ceph.osd.1.smithi203.stderr: 14: (clone()+0x6d) [0x7f1fd8dc3b3d]

/a/nojha-2018-10-16_22:54:08-rados-master-distro-basic-smithi/3149064/

Actions #10

Updated by Josh Durgin about 5 years ago

  • Status changed from Resolved to 12

Looks like this is still possible to hit:

http://pulpito.ceph.com/nojha-2019-02-08_05:52:40-rados-wip-36739-2019-02-07-distro-basic-smithi/3564135/

  -325> 2019-02-08 09:14:54.561 7fe906ee1700 -1 log_channel(cluster) log [ERR] : 2.d past_intervals [595,597) start interval does not contain the required bound [594,597) start
  -324> 2019-02-08 09:14:54.561 7fe906ee1700 -1 osd.5 pg_epoch: 597 pg[2.d( v 582'947 (582'947,582'947] lb MIN (NIBBLEWISE) local-lis/les=0/594 n=27 ec=521/17 lis/c 595/595 les/c/f 594/594/0 591/597/595) [4,1]/[4,1,2] r=-1 lpr=597 pi=[595,597)/1 crt=582'947 lcod 0'0 remapped NOTIFY mbc={}] 2.d past_intervals [595,597) start interval does not contain the required bound [594,597) start
Actions #11

Updated by Sage Weil about 5 years ago

/a/sage-2019-02-13_00:42:53-rados-wip-sage2-testing-2019-02-12-1700-distro-basic-smithi/3581667

description: rados/thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/{default.yaml} backoff/peering_and_degraded.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/crush-compat.yaml msgr-failures/few.yaml msgr/random.yaml objectstore/bluestore-comp.yaml rados.yaml rocksdb.yaml supported-random-distro$/{ubuntu_latest.yaml} thrashers/careful.yaml thrashosds-health.yaml workloads/admin_socket_objecter_requests.yaml}

Actions #12

Updated by Sage Weil about 5 years ago

  • Status changed from 12 to Fix Under Review
Actions #13

Updated by Sage Weil about 5 years ago

    -4> 2019-02-13 02:03:10.110 7f65c7fd9700 10 osd.1 pg_epoch: 465 pg[2.6( v 452'31657 (452'31657,452'31657] lb MIN (NIBBLEWISE) local-lis/les=0/459 n=23 ec=335/19 lis/c 460/460 les/c/f 459/459/0 456/465/462) [4,0] r=-1 lpr=465 pi=[460,465)/2 crt=452'31657 lcod 0'0 unknown NOTIFY mbc={}] check_recovery_sources no source osds () went down
    -3> 2019-02-13 02:03:10.110 7f65c7fd9700 -1 log_channel(cluster) log [ERR] : 2.6 past_intervals [460,465) start interval does not contain the required bound [459,465) start

note that last_epoch_clean is 459 but interval starts in 460.

root cause is the source pg 2.16 went clean in 460, but the target went clean in 459, and the former is what is reported to the mon.

b79442efce479fde7f8bda8fdadf86e91003a327 tried to fix this but compared last_epoch_clean before it was assigned.

Actions #14

Updated by Sage Weil about 5 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF