Project

General

Profile

Actions

Bug #49689

closed

osd/PeeringState.cc: ceph_abort_msg("past_interval start interval mismatch") start

Added by Neha Ojha about 3 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):

06609aa10e3319971c8c6251223bafde6e0052349f3e9a5a03edb74f42e3fa2d
070400b67064b04e820f22c4c5c2d50166103b509fcd8578a829d884ab76d18f
0f658d607df04a7f8f53ddb61c044a0ed66469c4148b9c067310ca27088433a9
180423f0717e94fd75d87d7f5dec045b5cbb52edb27205abc5927281a515054a
328b7ead9e7684c4b8487ddfe9edb53e8eb873927dde0036a4324ac081c862dc
36c20a627e03678780b2fe3dc808af4d2b2567a472a4f61d8178c61d7184a0bb
3a50bb9444331ec2b94a68f898f8ab0692ec7500e2e335a31ba6491973c0a17b
48ab57a2c81536fcd4d9c8a7ae6b0d297421a5e916602ccf1c53aded5adf1b8a
7bdc9f715c000afbbc2b75000e4d5623bc2daf351addab915c48f83918c9f25e
80cd84b35982a53d03a84567bc86bf0bf4931d6bd74ada22d61dce2dc61703dd
8eaef1c8be5ac7dcdf10d5a4634f98473e426c2abd231e361a8165b7c5b3c43d
94e5824296c208a0e9854ed27e55e34cf4e42c747119e6bc67aad328b511242a
95d3f1ffec846b1fe432b371d1bb2f07c934d7a56d49a10e7f0d6b989d7e21c7
b8f4aedd70ff355796eba056d35e5df524c19461e54298a94f741f6f11d31d94
bea7cac302e857eee2324d2293899a2a015475abf2766cd9ad62e5d30f204468
c7613c18f4910e58190221e6cca3f422e45d6d916ca5af61b275f26ba8d86209
c9422518b50823c89955b70e6f03330014f3ef1cb421aafc5ea58c0d91d4f179
cb6a35bf8176df5e9719943cc2ecf2ecc568dcfca315f72a40690049bf04a13a
f8cd44714c95a9c477b7509be3ceffb9edc6773399b109719c87bb8679e1305e
fbce5293130f40a9e8ea5f6f5285333e5abf22d7ebfc626e364b89b9f74a130b


Description

2021-03-09T02:06:34.259 INFO:tasks.ceph.osd.1.smithi180.stderr:2021-03-09T02:06:34.258+0000 7fd53172a700 -1 log_channel(cluster) log [ERR] : 3.2s0 past_intervals [218,272) start interval does not contain the required bound [195,272) start
2021-03-09T02:06:34.260 INFO:tasks.ceph.osd.1.smithi180.stderr:2021-03-09T02:06:34.258+0000 7fd53172a700 -1 osd.1 pg_epoch: 272 pg[3.2s0( empty local-lis/les=0/0 n=0 ec=20/20 lis/c=44/44 les/c/f=45/45/0 sis=272) [1,0,3,6]p1(0) r=0 lpr=272 pi=[218,272)/1 crt=0'0 mlcod 0'0 peering mbc={}] 3.2s0 past_intervals [218,272) start interval does not contain the required bound [195,272) start
2021-03-09T02:06:34.263 INFO:tasks.ceph.osd.1.smithi180.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.1.0-627-g9e448db9/rpm/el8/BUILD/ceph-16.1.0-627-g9e448db9/src/osd/PeeringState.cc: In function 'void PeeringState::check_past_interval_bounds() const' thread 7fd53172a700 time 2021-03-09T02:06:34.259543+0000
2021-03-09T02:06:34.263 INFO:tasks.ceph.osd.1.smithi180.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.1.0-627-g9e448db9/rpm/el8/BUILD/ceph-16.1.0-627-g9e448db9/src/osd/PeeringState.cc: 981: ceph_abort_msg("past_interval start interval mismatch")
2021-03-09T02:06:34.264 INFO:tasks.ceph.osd.1.smithi180.stderr: ceph version 16.1.0-627-g9e448db9 (9e448db9e385343da144613d840349182443aeee) pacific (rc)
2021-03-09T02:06:34.264 INFO:tasks.ceph.osd.1.smithi180.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x55aed1c5a7e0]
2021-03-09T02:06:34.264 INFO:tasks.ceph.osd.1.smithi180.stderr: 2: (PeeringState::check_past_interval_bounds() const+0x6ed) [0x55aed1fdc62d]
2021-03-09T02:06:34.264 INFO:tasks.ceph.osd.1.smithi180.stderr: 3: (PeeringState::GetInfo::GetInfo(boost::statechart::state<PeeringState::GetInfo, PeeringState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x145) [0x55aed2009d95]
2021-03-09T02:06:34.264 INFO:tasks.ceph.osd.1.smithi180.stderr: 4: (boost::statechart::state<PeeringState::Primary, PeeringState::Started, PeeringState::Peering, (boost::statechart::history_mode)0>::deep_construct(boost::intrusive_ptr<PeeringState::Started> const&, boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&)+0x146) [0x55aed202efc6]
2021-03-09T02:06:34.265 INFO:tasks.ceph.osd.1.smithi180.stderr: 5: (boost::statechart::simple_state<PeeringState::Start, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x15b) [0x55aed203453b]
2021-03-09T02:06:34.265 INFO:tasks.ceph.osd.1.smithi180.stderr: 6: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()+0xa7) [0x55aed201db97]
2021-03-09T02:06:34.265 INFO:tasks.ceph.osd.1.smithi180.stderr: 7: (PeeringState::activate_map(PeeringCtx&)+0x1c2) [0x55aed1fd8032]
2021-03-09T02:06:34.265 INFO:tasks.ceph.osd.1.smithi180.stderr: 8: (PG::handle_activate_map(PeeringCtx&)+0x127) [0x55aed1e12147]
2021-03-09T02:06:34.266 INFO:tasks.ceph.osd.1.smithi180.stderr: 9: (OSD::handle_pg_create_info(std::shared_ptr<OSDMap const> const&, PGCreateInfo const*)+0x711) [0x55aed1d5ae91]
2021-03-09T02:06:34.266 INFO:tasks.ceph.osd.1.smithi180.stderr: 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x33b2) [0x55aed1d80022]
2021-03-09T02:06:34.266 INFO:tasks.ceph.osd.1.smithi180.stderr: 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55aed23e3fd4]
2021-03-09T02:06:34.266 INFO:tasks.ceph.osd.1.smithi180.stderr: 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55aed23e6c74]
2021-03-09T02:06:34.266 INFO:tasks.ceph.osd.1.smithi180.stderr: 13: /lib64/libpthread.so.0(+0x814a) [0x7fd559e3614a]

/a/yuriw-2021-03-08_21:03:18-rados-wip-yuri5-testing-2021-03-08-1049-pacific-distro-basic-smithi/5947593 - no logs


Files

PG 8.243.xlsx (12.2 KB) PG 8.243.xlsx Shu Yu, 02/10/2022 08:50 AM
osds_log.tar.gz (13.6 KB) osds_log.tar.gz Shu Yu, 02/10/2022 08:50 AM

Related issues 15 (1 open14 closed)

Related to crimson - Bug #55550: crimson: check_past_interval_bounds() assert failureResolvedMatan Breizman

Actions
Has duplicate RADOS - Bug #52212: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Has duplicate RADOS - Bug #52160: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Has duplicate RADOS - Bug #52159: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Has duplicate RADOS - Bug #55549: OSDs crashingResolved

Actions
Has duplicate RADOS - Bug #56289: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Has duplicate RADOS - Bug #54710: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Has duplicate RADOS - Bug #54709: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Has duplicate RADOS - Bug #54708: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Has duplicate RADOS - Bug #59777: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Has duplicate RADOS - Bug #59778: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Has duplicate RADOS - Bug #59779: crash: void PeeringState::check_past_interval_bounds() const: abortDuplicate

Actions
Precedes RADOS - Feature #64002: Allow clusters to recover from "past_interval start interval mismatch"NewMatan Breizman

Actions
Copied to RADOS - Backport #61149: pacific: osd/PeeringState.cc: ceph_abort_msg("past_interval start interval mismatch") startResolvedMatan BreizmanActions
Copied to RADOS - Backport #61150: quincy: osd/PeeringState.cc: ceph_abort_msg("past_interval start interval mismatch") startResolvedMatan BreizmanActions
Actions #1

Updated by Sage Weil about 3 years ago

  • Priority changed from Normal to Urgent

/a/sage-2021-03-20_01:11:45-rados-wip-sage-testing-2021-03-19-1647-distro-basic-smithi/5982282

with logs!

Actions #2

Updated by Neha Ojha about 3 years ago

  • Subject changed from osd/PeeringState.cc: ceph_abort_msg("past_interval start interval mismatch") to osd/PeeringState.cc: ceph_abort_msg("past_interval start interval mismatch") start
2021-03-20T05:10:47.151+0000 7f29a5cd4700 -1 osd.11 pg_epoch: 375 pg[3.11s0( empty local-lis/les=0/0 n=0 ec=109/27 lis/c=114/114 les/c/f=115/115/0 sis=375) [11,4,6,0,3,7]p11(0) r=0 lpr=375 pi=[362,375)/0 crt=0'0 mlcod 0'0 peering mbc={}] 3.11s0 past_intervals [362,375) start interval does not contain the required bound [151,375) start

looks like this pg had the same pi when it got created

2021-03-20T05:10:47.150+0000 7f29a5cd4700 20 osd.11 op_wq(1) _process 3.11s0 no pg, should create on OpSchedulerItem(3.11s0 PGPeeringEvent(epoch_sent: 375 epoch_requested: 375 MNotifyRec 3.11s0 from 9(5) notify: (query:375 sent:375 3.11s5( v 304'349 (0'0,304'349] local-lis/les=114/115 n=11 ec=109/27 lis/c=114/114 les/c/f=115/115/0 sis=375) 5->0 ([362,374] all_participants=0(3),3(4),4(1),6(2),7(5),10(0) intervals=)) features: 0x3f01cfbb7ffdffff +create_info) prio 255 cost 10 e375)
2021-03-20T05:10:47.151+0000 7f29a5cd4700 10 osd.11 pg_epoch: 375 pg[3.11s0( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) []p? r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown mbc={}] init role 0 up [11,4,6,0,3,7] acting [11,4,6,0,3,7] history ec=109/27 lis/c=114/114 les/c/f=115/115/0 sis=375 past_intervals ([362,374] all_participants=0(3),3(4),4(1),6(2),7(5),10(0) intervals=)
2021-03-20T05:10:47.151+0000 7f29a5cd4700 20 osd.11 pg_epoch: 375 pg[3.11s0( empty local-lis/les=0/0 n=0 ec=109/27 lis/c=114/114 les/c/f=115/115/0 sis=375) [11,4,6,0,3,7]p11(0) r=0 lpr=0 pi=[362,375)/0 crt=0'0 mlcod 0'0 unknown mbc={}] on_new_interval

but get_required_past_interval_bounds determines start to be 151 based on oldest_map, which is osd->get_superblock().oldest_map

If my understanding is correct, this is logged here

2021-03-20T05:10:47.147+0000 7f29b34ef700 10 osd.11 374 write_superblock sb(0437ecdc-8b43-4f3f-ab79-d7775b8c351f osd.11 de2bf8dc-46eb-4787-ad9f-794f5c25bfc7 e375 [151,375] lci=[18,375])

We need to figure out which start value one is incorrect.

Actions #3

Updated by Sage Weil about 3 years ago

2021-04-02T15:15:41.194 INFO:tasks.ceph.osd.7.smithi043.stderr:2021-04-02T15:15:41.191+0000 7fddaf38e700 -1 log_channel(cluster) log [ERR] : 1.4 past_intervals [836,871) start interval does not contain the required bound [675,871) start

/a/sage-2021-04-02_14:05:32-rados-wip-sage2-testing-2021-04-01-1417-pacific-distro-basic-smithi/6017117
(no logs)
Actions #4

Updated by Neha Ojha almost 3 years ago

/a/yuriw-2021-06-02_18:33:05-rados-wip-yuri3-testing-2021-06-02-0826-pacific-distro-basic-smithi/6147462 - with logs!

Actions #5

Updated by Neha Ojha over 2 years ago

/a/yuriw-2021-07-21_19:04:13-rados-wip-yuri4-testing-2021-07-21-0859-pacific-distro-basic-smithi/6285380

Actions #6

Updated by Neha Ojha over 2 years ago

  • Has duplicate Bug #52212: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #7

Updated by Neha Ojha over 2 years ago

  • Has duplicate Bug #52160: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #8

Updated by Neha Ojha over 2 years ago

  • Has duplicate Bug #52159: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #9

Updated by Sage Weil over 2 years ago

/a/sage-2021-09-16_18:04:19-rados-wip-sage-testing-2021-09-16-1020-distro-basic-smithi/6393058

note that this is mapgap:
description: rados/thrash/{0-size-min-size-overrides/2-size-2-min-size 1-pg-log-overrides/normal_pg_log
2-recovery-overrides/{default} 3-scrub-overrides/{max-simultaneous-scrubs-2} backoff/normal
ceph clusters/{fixed-2 openstack} crc-failures/bad_map_crc_failure d-balancer/on
mon_election/connectivity msgr-failures/osd-dispatch-delay msgr/async-v1only objectstore/bluestore-comp-lz4
rados supported-random-distro$/{rhel_8} thrashers/mapgap thrashosds-health workloads/cache-agent-big}

and the failure is preceded by

2021-09-17T03:58:13.020 INFO:tasks.ceph.osd.5.smithi142.stderr:2021-09-17T03:58:13.019+0000 7feac3e5c700 -1 osd.5 434 failed to load OSD map for epoch 582, got 0 bytes
2021-09-17T03:58:13.021 INFO:tasks.ceph.osd.5.smithi142.stderr:2021-09-17T03:58:13.020+0000 7feac3e5c700 -1 osd.5 434 handle_osd_map injecting map crc failure
2021-09-17T03:58:13.172 INFO:tasks.ceph.osd.5.smithi142.stderr:2021-09-17T03:58:13.025+0000 7feab4e3e700 -1 osd.5 583 failed to load OSD map for epoch 435, got 0 bytes
...
2021-09-17T03:58:13.485 INFO:tasks.ceph.osd.5.smithi142.stderr:2021-09-17T03:58:13.051+0000 7feab4e3e700 -1 osd.5 623 failed to load OSD map for epoch 582, got 0 bytes

Actions #10

Updated by Anonymous over 2 years ago

Neha Ojha wrote:

[...]

looks like this pg had the same pi when it got created

[...]

but get_required_past_interval_bounds determines start to be 151 based on oldest_map, which is osd->get_superblock().oldest_map

If my understanding is correct, this is logged here

[...]

We need to figure out which start value one is incorrect.

I meet similar issue at ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)

-283> 2021-12-08 21:56:28.295 7fdf3dad6700 -1 log_channel(cluster) log [ERR] : 8.243s5 past_intervals [2021,2526) start interval does not contain the required bound [1705,2526) start [LogChannel::do_log:224]
-278> 2021-12-08 21:56:28.295 7fdf3dad6700 -1 osd.1 pg_epoch: 2528 pg[8.243s5( v 1094'5 lc 1094'3 (0'0,1094'5] local-lis/les=2526/2527 n=0 ec=1070/1070 lis/c 2526/1491 les/c/f 2527/1503/0 2526/2526/2526) [124,118,93,33,55,1,36,69]p124(0) r=5 lpr=2527 pi=[2021,2526)/1 crt=1094'5 lcod 0'0 unknown m=2 mbc={}] 8.243s5 past_intervals [2021,2526) start interval does not contain the required bound [1705,2526) start [PG::check_past_interval_bounds:952]
-1> 2021-12-08 21:56:28.304 7fdf3dad6700 -1 /workspace/rpmbuild/BUILD/ceph-14.2.8/src/osd/PG.cc: In function 'void PG::check_past_interval_bounds() const' thread 7fdf3dad6700 time 2021-12-08 21:56:28.296099
/workspace/rpmbuild/BUILD/ceph-14.2.8/src/osd/PG.cc: 955: ceph_abort_msg("past_interval start interval mismatch")
ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xe0) [0x555bcc1e14a6]
2: (PG::check_past_interval_bounds() const+0x12b7) [0x555bcc42df77]
3: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x6f8) [0x555bcc47a468]
4: (boost::statechart::simple_state&lt;PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x75) [0x555bcc4efa85]
5: (PG::handle_advance_map(std::shared_ptr&lt;OSDMap const&gt;, std::shared_ptr&lt;OSDMap const&gt;, std::vector&lt;int, std::allocator&lt;int&gt; >&, int, std::vector&lt;int, std::allocator&lt;int&gt; >&, int, PG::RecoveryCtx*)+0x53d) [0x555bcc47870d]
6: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x2de) [0x555bcc394ffe]
7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr&lt;PGPeeringEvent&gt;, ThreadPool::TPHandle&)+0xa6) [0x555bcc397ec6]
8: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x51) [0x555bcc6f5bb1]
9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8da) [0x555bcc38382a]
10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x65e) [0x555bccbdd2be]
11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x555bccbe5700]
12: (()+0x7e65) [0x7facb2094e65]
13: (clone()+0x6d) [0x7facb0f5788d]

set osd->get_superblock().oldest_map from 1705 to 2021, then, fortunately, start osd successfully. I know it is dangers, also do not to know its influence, but I have no idea else to recover it.
Another question, why not add information such as function name and line number in the log?

Actions #11

Updated by Shu Yu over 2 years ago

匿名用户 wrote:

Neha Ojha wrote:

[...]

looks like this pg had the same pi when it got created

[...]

but get_required_past_interval_bounds determines start to be 151 based on oldest_map, which is osd->get_superblock().oldest_map

If my understanding is correct, this is logged here

[...]

We need to figure out which start value one is incorrect.

I meet similar issue at ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)

-283> 2021-12-08 21:56:28.295 7fdf3dad6700 -1 log_channel(cluster) log [ERR] : 8.243s5 past_intervals [2021,2526) start interval does not contain the required bound [1705,2526) start [LogChannel::do_log:224]
-278> 2021-12-08 21:56:28.295 7fdf3dad6700 -1 osd.1 pg_epoch: 2528 pg[8.243s5( v 1094'5 lc 1094'3 (0'0,1094'5] local-lis/les=2526/2527 n=0 ec=1070/1070 lis/c 2526/1491 les/c/f 2527/1503/0 2526/2526/2526) [124,118,93,33,55,1,36,69]p124(0) r=5 lpr=2527 pi=[2021,2526)/1 crt=1094'5 lcod 0'0 unknown m=2 mbc={}] 8.243s5 past_intervals [2021,2526) start interval does not contain the required bound [1705,2526) start [PG::check_past_interval_bounds:952]
-1> 2021-12-08 21:56:28.304 7fdf3dad6700 -1 /workspace/rpmbuild/BUILD/ceph-14.2.8/src/osd/PG.cc: In function 'void PG::check_past_interval_bounds() const' thread 7fdf3dad6700 time 2021-12-08 21:56:28.296099
/workspace/rpmbuild/BUILD/ceph-14.2.8/src/osd/PG.cc: 955: ceph_abort_msg("past_interval start interval mismatch")
ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xe0) [0x555bcc1e14a6]
2: (PG::check_past_interval_bounds() const+0x12b7) [0x555bcc42df77]
3: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x6f8) [0x555bcc47a468]
4: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x75) [0x555bcc4efa85]
5: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x53d) [0x555bcc47870d]
6: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x2de) [0x555bcc394ffe]
7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xa6) [0x555bcc397ec6]
8: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x555bcc6f5bb1]
9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8da) [0x555bcc38382a]
10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x65e) [0x555bccbdd2be]
11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x555bccbe5700]
12: (()+0x7e65) [0x7facb2094e65]
13: (clone()+0x6d) [0x7facb0f5788d]

set osd->get_superblock().oldest_map from 1705 to 2021, then, fortunately, start osd successfully. I know it is dangers, also do not to know its influence, but I have no idea else to recover it.
Another question, why not add information such as function name and line number in the log?

Missing a message, PG 8.243 status
  1. ceph pg ls 8 | grep -w 8.243
    8.243 0 0 0 0 0 0 0 5 active+clean 4h 1094'5 3422:1522 [124,118,93,33,63,11,36,69]p124 [124,118,93,33,63,11,36,69]p124 2021-12-07 05:15:05.075491 2021-12-06 02:41:08.269278
Actions #12

Updated by Neha Ojha about 2 years ago

  • Assignee set to Matan Breizman
Actions #13

Updated by Matan Breizman about 2 years ago

Shu Yu wrote:

Missing a message, PG 8.243 status
  1. ceph pg ls 8 | grep -w 8.243
    8.243 0 0 0 0 0 0 0 5 active+clean 4h 1094'5 3422:1522 [124,118,93,33,63,11,36,69]p124 [124,118,93,33,63,11,36,69]p124 2021-12-07 05:15:05.075491 2021-12-06 02:41:08.269278

Hi Shu Yu, are you able to reproduce this issue?

Updated by Shu Yu about 2 years ago

Matan Breizman wrote:

Shu Yu wrote:

Missing a message, PG 8.243 status
  1. ceph pg ls 8 | grep -w 8.243
    8.243 0 0 0 0 0 0 0 5 active+clean 4h 1094'5 3422:1522 [124,118,93,33,63,11,36,69]p124 [124,118,93,33,63,11,36,69]p124 2021-12-07 05:15:05.075491 2021-12-06 02:41:08.269278

Hi Shu Yu, are you able to reproduce this issue?

No, I haven't reproduced the issue yet. I collect detail info of PG 8.243, maybe help you.

Actions #15

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v15.2.10, v15.2.13, v15.2.15, v15.2.4, v15.2.5, v15.2.6, v15.2.8 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=3a50bb9444331ec2b94a68f898f8ab0692ec7500e2e335a31ba6491973c0a17b

Assert condition: abort
Assert function: void PeeringState::check_past_interval_bounds() const

Sanitized backtrace:

    PeeringState::check_past_interval_bounds() const
    PeeringState::Reset::react(PeeringState::AdvMap const&)
    boost::statechart::simple_state<PeeringState::Reset, PeeringState::PeeringMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)
    boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()
    PeeringState::advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)
    PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)
    OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)
    OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)
    ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
    OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
    ShardedThreadPool::shardedthreadpool_worker(unsigned int)
    ShardedThreadPool::WorkThreadSharded::entry()
    clone()

Crash dump sample:
{
    "assert_condition": "abort",
    "assert_file": "osd/PeeringState.cc",
    "assert_func": "void PeeringState::check_past_interval_bounds() const",
    "assert_line": 945,
    "assert_msg": "osd/PeeringState.cc: In function 'void PeeringState::check_past_interval_bounds() const' thread 7f4800572700 time 2022-03-09T05:20:47.210406+1300\nosd/PeeringState.cc: 945: ceph_abort_msg(\"past_interval start interval mismatch\")",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "(()+0x153c0) [0x7f48325003c0]",
        "(gsignal()+0xcb) [0x7f4831f8418b]",
        "(abort()+0x12b) [0x7f4831f63859]",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b9) [0x556d94bcf07d]",
        "(PeeringState::check_past_interval_bounds() const+0x822) [0x556d94f49802]",
        "(PeeringState::Reset::react(PeeringState::AdvMap const&)+0x1f7) [0x556d94f61597]",
        "(boost::statechart::simple_state<PeeringState::Reset, PeeringState::PeeringMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x171) [0x556d94fb0e11]",
        "(boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()+0xb4) [0x556d94f937f4]",
        "(PeeringState::advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0x23e) [0x556d94f44c6e]",
        "(PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0x105) [0x556d94d49e95]",
        "(OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x373) [0x556d94cbd193]",
        "(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xd3) [0x556d94cbf473]",
        "(ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x556d94f2d536]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xe8d) [0x556d94cacdcd]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x556d95391a93]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x556d95394894]",
        "(()+0x9609) [0x7f48324f4609]",
        "(clone()+0x43) [0x7f4832060293]" 
    ],
    "ceph_version": "15.2.15",
    "crash_id": "2022-03-08T16:20:47.234951Z_9503b0e3-acc5-4a6b-9fdf-ccbd3986b15f",
    "entity_name": "osd.58e92efe631744653fa4934cbfb8773015772a5d",
    "os_id": "ubuntu",
    "os_name": "Ubuntu",
    "os_version": "20.04.3 LTS (Focal Fossa)",
    "os_version_id": "20.04",
    "process_name": "ceph-osd",
    "stack_sig": "7bdc9f715c000afbbc2b75000e4d5623bc2daf351addab915c48f83918c9f25e",
    "timestamp": "2022-03-08T16:20:47.234951Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-89-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021" 
}

Actions #16

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #17

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #18

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #19

Updated by Radoslaw Zarzynski about 2 years ago

  • Status changed from New to Need More Info
  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #20

Updated by Laura Flores almost 2 years ago

Actions #21

Updated by Radoslaw Zarzynski almost 2 years ago

Actions #22

Updated by Radoslaw Zarzynski almost 2 years ago

Actions #23

Updated by Radoslaw Zarzynski almost 2 years ago

If more date is necessary, it might be worth no contact Richard Bateman who replicated something awfully similar to this (bug: #49689).

Actions #24

Updated by Matan Breizman almost 2 years ago

  • Status changed from Need More Info to Fix Under Review
  • Pull request ID set to 46808

PR is marked as draft for now.

Actions #25

Updated by Radoslaw Zarzynski over 1 year ago

A note from the bug scrub: work in progress.

Actions #27

Updated by Matan Breizman over 1 year ago

  • Has duplicate Bug #56289: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #28

Updated by Matan Breizman over 1 year ago

  • Has duplicate Bug #54710: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #29

Updated by Matan Breizman over 1 year ago

  • Has duplicate Bug #54709: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #30

Updated by Matan Breizman over 1 year ago

  • Has duplicate Bug #54708: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #31

Updated by Samuel Just over 1 year ago

I wrote up an working explanation of PastIntervals in https://github.com/athanatos/ceph/tree/sjust/wip-49689-past-intervals-bounds . I think the last paragraph explains the case for pg 3.11 /a/sage-2021-03-20_01:11:45-rados-wip-sage-testing-2021-03-19-1647-distro-basic-smithi/5982282 as well as a fix. Haven't checked the other crashes, I'm a bit suspicious of the map gap PastIntervals trimming case.

Actions #32

Updated by Radoslaw Zarzynski over 1 year ago

  • Pull request ID changed from 46808 to 48706

Updated the PR link.

Actions #33

Updated by Neha Ojha over 1 year ago

I think https://tracker.ceph.com/issues/49689#note-31 makes sense and the following logs also show what max_oldest_map was set to.

-2754> 2021-03-20T05:10:50.143+0000 7fe928542700 10 osd.1 378 build_incremental_map_msg oldest map 362 > since 338, starting with full map
...
-702> 2021-03-20T05:10:50.150+0000 7fe929d45700 -1 osd.1 pg_epoch: 378 pg[3.fs5( empty local-lis/les=0/0 n=0 ec=27/27 lis/c=27/27 les/c/f=28/28/0 sis=378) [NONE,NONE,NONE,NONE,NONE,1]p1(5) r=5 lpr=378 pi=[362,378)/0 crt=0'0 mlcod 0'0 peering mbc={}] 3.fs5 past_intervals [362,378) start interval does not contain the required bound [241,378) start
Actions #34

Updated by Radoslaw Zarzynski over 1 year ago

bumping up (fix awaits qa)

Actions #35

Updated by Radoslaw Zarzynski over 1 year ago

Label assigned but blocked due to the lab issue. Bump up.

Actions #36

Updated by Radoslaw Zarzynski over 1 year ago

bump up

Actions #37

Updated by Radoslaw Zarzynski about 1 year ago

Looks it's ready for merge now. Yuri is pinged.

Actions #38

Updated by Matan Breizman about 1 year ago

  • Status changed from Fix Under Review to In Progress

Marking as "In Progress" again until requested changes are ready for review.

Actions #39

Updated by Matan Breizman about 1 year ago

  • Status changed from In Progress to Fix Under Review
Actions #40

Updated by Matan Breizman about 1 year ago

  • Status changed from Fix Under Review to Resolved
Actions #41

Updated by Telemetry Bot 11 months ago

  • Related to Bug #59777: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #42

Updated by Telemetry Bot 11 months ago

  • Related to Bug #59778: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #43

Updated by Telemetry Bot 11 months ago

  • Related to Bug #59779: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #44

Updated by Matan Breizman 11 months ago

  • Status changed from Resolved to Pending Backport
  • Backport changed from pacific to pacific,quincy

There are multiple Telemetry reports of this issue in P and Q, we should consider backporting this patch.

Actions #45

Updated by Backport Bot 11 months ago

  • Copied to Backport #61149: pacific: osd/PeeringState.cc: ceph_abort_msg("past_interval start interval mismatch") start added
Actions #46

Updated by Backport Bot 11 months ago

  • Copied to Backport #61150: quincy: osd/PeeringState.cc: ceph_abort_msg("past_interval start interval mismatch") start added
Actions #47

Updated by Backport Bot 11 months ago

  • Tags set to backport_processed
Actions #48

Updated by Matan Breizman 11 months ago

  • Related to Bug #55550: crimson: check_past_interval_bounds() assert failure added
Actions #49

Updated by Matan Breizman 11 months ago

  • Related to deleted (Bug #59777: crash: void PeeringState::check_past_interval_bounds() const: abort)
Actions #50

Updated by Matan Breizman 11 months ago

  • Has duplicate Bug #59777: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #51

Updated by Matan Breizman 11 months ago

  • Related to deleted (Bug #59778: crash: void PeeringState::check_past_interval_bounds() const: abort)
Actions #52

Updated by Matan Breizman 11 months ago

  • Has duplicate Bug #59778: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #53

Updated by Matan Breizman 11 months ago

  • Related to deleted (Bug #59779: crash: void PeeringState::check_past_interval_bounds() const: abort)
Actions #54

Updated by Matan Breizman 11 months ago

  • Has duplicate Bug #59779: crash: void PeeringState::check_past_interval_bounds() const: abort added
Actions #55

Updated by Matan Breizman 8 months ago

  • Status changed from Pending Backport to Resolved
Actions #56

Updated by Matan Breizman 3 months ago

  • Precedes Feature #64002: Allow clusters to recover from "past_interval start interval mismatch" added
Actions

Also available in: Atom PDF