Bug #65453: osd crashes due to outdated recovery ops - crimson - Ceph

Actions

Copy link

Bug #65453

open

osd crashes due to outdated recovery ops

Added by Xuehan Xu 24 days ago. Updated 21 days ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Xuehan Xu

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

56848

Crash signature (v1):

Crash signature (v2):

Description

PGs' recovery backends don't discard old recovery ops

DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd - pg_advance_map(id=50513, detail=PGAdvanceMap(pg=3.5 from=74 to=75)): advancing map to 75
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 74 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+recovering+undersized+remapped PeeringState::advance_map handle_advance_map {}/{} -- -1/-1
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+recovering+undersized+remapped PeeringState::should_restart_peering new interval newup {} newacting {}
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+recovering+undersized+remapped state<Started/Primary/Active>: Active advmap interval change, fast return
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+recovering+undersized+remapped state<Started>: Started advmap
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+recovering+undersized+remapped PeeringState::should_restart_peering new interval newup {} newacting {}
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+recovering+undersized+remapped state<Started>: should_restart_peering, transitioning to Reset
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+recovering+undersized+remapped exit Started/Primary/Active/Recovering 0.349949 1 0.000187
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd - Exiting state: Started/Primary/Active/Recovering, entered at 1712832095.8452494, 0.000187741 spent on 1 events
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+undersized+remapped exit Started/Primary/Active 39.220619 0 0.000000
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd - Exiting state: Started/Primary/Active, entered at 1712832056.9745953, 0 spent on 0 events
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped exit Started/Primary 40.222177 0 0.000000
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd - Exiting state: Started/Primary, entered at 1712832055.9730542, 0 spent on 0 events
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped exit Started 40.222251 0 0.000000
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd - Exiting state: Started, entered at 1712832055.9729998, 0 spent on 0 events
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped enter Reset
INFO  2024-04-11 10:41:36,195 [shard 0:main] osd - Entering state: Reset
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=53 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped PeeringState::set_last_peering_reset set_last_peering_reset 75
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=75 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped PeeringState::set_last_peering_reset Clearing blocked outgoing recovery messages
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd - PG::try_flush_or_schedule_async: flush ...
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=75 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped PeeringState::set_last_peering_reset Beginning to block outgoing recovery messages
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=75 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped state<Reset>: Reset advmap
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=75 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped PeeringState::should_restart_peering new interval newup {} newacting {}
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=75 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped state<Reset>: should restart peering, calling start_peering_interval again
DEBUG 2024-04-11 10:41:36,195 [shard 0:main] osd -  pg_epoch 75 pg[3.5( v 69'6760 (0'0,69'6760] local-lis/les=53/54 n=1870 ec=15/15 lis/c=53/31 les/c/f=54/32/0 sis=53) [1,3,2]/[2,3] async=[1] r=0 lpr=75 pi=[31,53)/1 luod=69'6761 lua=51'6074 crt=69'6761 mlcod 47'5221 active+remapped PeeringState::set_last_peering_reset set_last_peering_reset 75

DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - 0x300005e91ee8 PeeringEvent<T>::with_pg: 0x300005e91ee8 peering_event(id=50534, detail=PeeringEvent(from=2 pgid=3.e sent=75 requested=75 evt=epoch_sent: 75 epoch_requested: 75 IntervalFlush)): pg present
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - background_recovery_sub(id=50535, detail=MOSDPGRecoveryDeleteReply(3.5 e74,53 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head,47'5268)): starting start_pg_operation
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - background_recovery_sub(id=50535, detail=MOSDPGRecoveryDeleteReply(3.5 e74,53 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head,47'5268)): start_pg_operation in await_active stage
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - background_recovery_sub(id=50535, detail=MOSDPGRecoveryDeleteReply(3.5 e74,53 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head,47'5268)): start_pg_operation active, entering await_map
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - background_recovery_sub(id=50535, detail=MOSDPGRecoveryDeleteReply(3.5 e74,53 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head,47'5268)): start_pg_operation await_map stage
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - background_recovery_sub(id=50535, detail=MOSDPGRecoveryDeleteReply(3.5 e74,53 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head,47'5268)): got map 74, entering get_pg_mapping
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - background_recovery_sub(id=50535, detail=MOSDPGRecoveryDeleteReply(3.5 e74,53 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head,47'5268)): can_create=false, target-core=0
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - background_recovery_sub(id=50535, detail=MOSDPGRecoveryDeleteReply(3.5 e74,53 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head,47'5268)): entering create_or_wait_pg
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - background_recovery_sub(id=50535, detail=MOSDPGRecoveryDeleteReply(3.5 e74,53 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head,47'5268)): have_pg
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - 0x3000024df2c8 RecoverySubRequest::with_pg: RecoverySubRequest::with_pg: background_recovery_sub(id=50535, detail=MOSDPGRecoveryDeleteReply(3.5 e74,53 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head,47'5268))
DEBUG 2024-04-11 10:41:36,205 [shard 0:main] osd - on_peer_recover: 3:a819f193:::benchmark_data_scephqa02.cpp.bjat.qianxin-in_138513_object30329:head, 47'5268 on 1
ERROR 2024-04-11 10:41:36,205 [shard 0:main] none - /da1/xxh/rpmbuild/BUILD/ceph-19.0.0-2539-g9b5dd7459cb/src/osd/osd_types.h:5060 : In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]', ceph_assert(%s)
p != missing.end()
Aborting on shard 0.
Backtrace:
Reactor stalled for 160 ms on shard 0. Backtrace: 0x2ed20ad 0x2e869cd 0x2e86d18 0x2e86ea7 0x12cdf 0x16927b 0x1ce1ae0 0x1ce2cfc 0x1ce3a17 0x1ce07b6 0x1ce1190 0x1ce1648 0x12cdf 0x4ea4e 0x21db4 0x1208355 0x12083ba 0x18e08ce 0x17960b3 0x17e9b78 0x180b90c 0x1734518 0x1481e5a 0x143f67d 0x149a6d2 0x14c50b1 0x14c5406 0x14c5c3e 0x144746a 0x1449882 0x144a0f6 0x1f823d6 0x1f71a22 0x1f723e5 0x1f72cb5 0x1f7382b 0x1f76475 0x2e7e877 0x2e7ec93 0x2ec02c5 0x2ec100c 0x2e0f17c 0x2e0fa04 0x13545fa 0x3aca2 0x13a947d
kernel callstack:
Reactor stalled for 278 ms on shard 0. Backtrace: 0x2ed20ad 0x2e869cd 0x2e86d18 0x2e86ea7 0x12cdf 0x1692bc 0x1ce2ab1 0x1ce3a17 0x1ce07b6 0x1ce1190 0x1ce1648 0x12cdf 0x4ea4e 0x21db4 0x1208355 0x12083ba 0x18e08ce 0x17960b3 0x17e9b78 0x180b90c 0x1734518 0x1481e5a 0x143f67d 0x149a6d2 0x14c50b1 0x14c5406 0x14c5c3e 0x144746a 0x1449882 0x144a0f6 0x1f823d6 0x1f71a22 0x1f723e5 0x1f72cb5 0x1f7382b 0x1f76475 0x2e7e877 0x2e7ec93 0x2ec02c5 0x2ec100c 0x2e0f17c 0x2e0fa04 0x13545fa 0x3aca2 0x13a947d
kernel callstack:
 0# gsignal in /lib64/libc.so.6
 1# abort in /lib64/libc.so.6
 2# ceph::__ceph_assert_fail(char const*, char const*, int, char const*) in ceph-osd
 3# ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...) in ceph-osd
 4# PeeringState::on_peer_recover(pg_shard_t, hobject_t const&, eversion_t const&) in ceph-osd
 5# PGRecovery::on_peer_recover(pg_shard_t, hobject_t const&, ObjectRecoveryInfo const&) in ceph-osd
 6# ReplicatedRecoveryBackend::handle_recovery_delete_reply(boost::intrusive_ptr<MOSDPGRecoveryDeleteReply>) in ceph-osd
 7# ReplicatedRecoveryBackend::handle_recovery_op(boost::intrusive_ptr<MOSDFastDispatchOp>, crimson::local_shared_foreign_ptr<crimson::local_shared_foreign_ptr<seastar::shared_ptr<crimson::net::Connection> > >) in ceph-osd
 8# crimson::osd::RecoverySubRequest::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>) in ceph-osd
 9# auto crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<crimson::osd::PG> > >::safe_then<crimson::osd::PGShardManager::run_with_pg_maybe_wait<crimson::osd::RecoverySubRequest>(crimson::osd::RecoverySubRequest::IRef, crimson::osd::ShardServices&)::{lambda(boost::intrusive_ptr<crimson::osd::PG>)#1}, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::pass_further>(crimson::osd::PGShardManager::run_with_pg_maybe_wait<crimson::osd::RecoverySubRequest>(crimson::osd::RecoverySubRequest::IRef, crimson::osd::ShardServices&)::{lambda(boost::intrusive_ptr<crimson::osd::PG>)#1}&&, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::pass_further&&)::{lambda(crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::pass_further)#1}::operator()<seastar::future<boost::intrusive_ptr<crimson::osd::PG> > >(crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::pass_further) in ceph-osd

Actions

Copy link

Updated by Matan Breizman 21 days ago

Status changed from New to Fix Under Review
Pull request ID set to 56848

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » crimson

Custom queries

Bug #65453

osd crashes due to outdated recovery ops

Updated by Matan Breizman 21 days ago