Project

General

Profile

Bug #46705

Negative peer_num_objects crashes osd

Added by xie xingguo 4 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature:

Description

https://pulpito.ceph.com/xxg-2020-07-20_02:56:08-rados:thrash-nautilus-lie-distro-basic-smithi/5240518/

Full stack:

 ceph version 14.2.10-211-g951f3f726d7 (951f3f726d7ae38e9407dc144a36197de856fa80) nautilus (stable)
 1: (()+0xf630) [0x7f3548c4b630]
 2: (gsignal()+0x37) [0x7f3547e6f387]
 3: (abort()+0x148) [0x7f3547e70a78]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x5619811261f8]
 5: (()+0x4ce3d0) [0x5619811263d0]
 6: (PG::choose_acting(pg_shard_t&, bool, bool*, bool)+0x148a) [0x5619812bcf5a]
 7: (PG::RecoveryState::Recovered::Recovered(boost::statechart::state<PG::RecoveryState::Recovered, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x193) [0x5619812eecd3]
 8: (PG::RecoveryState::Backfilling::react(PG::Backfilled const&)+0x94) [0x5619812fa794]
 9: (boost::statechart::simple_state<PG::RecoveryState::Backfilling, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xbd) [0x56198133ef7d]
 10: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5a) [0x56198131989a]
 11: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2ca) [0x5619813058ea]
 12: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x20c) [0x5619812372ac]
 13: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x5619814bb9a1]
 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x891) [0x56198122b321]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c6) [0x56198182bed6]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56198182f450]
 17: (()+0x7ea5) [0x7f3548c43ea5]
 18: (clone()+0x6d) [0x7f3547f378dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Log:

 -5645> 2020-07-20 04:34:32.067 7f351e329700  5 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] exit Started/Primary/Active/Backfilling 0.090218 5 0.000236
 -5642> 2020-07-20 04:34:32.067 7f351e329700  5 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] enter Started/Primary/Active/Recovered
 -5640> 2020-07-20 04:34:32.067 7f351e329700 10 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] needs_recovery is recovered
 -5637> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats actingset 5,6 upset 3,6 acting_recovery_backfill 3,5,6
 -5636> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats acting [5,6] up [3,6]
 -5633> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats shard 5 primary objects 0 missing 0
 -5632> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats shard 3 objects -1 missing 1
 -5631> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats shard 6 objects 0 missing 0
 -5630> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats object_location_counts {}
 -5629> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats missing shard 6 missing= 0
 -5628> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats missing shard 3 missing= 1
 -5627> 2020-07-20 04:34:32.068 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats acting shard 5 missing= 0
 -5626> 2020-07-20 04:34:32.068 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats degraded 0
 -5625> 2020-07-20 04:34:32.068 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] _update_calc_stats misplaced 1
 -5624> 2020-07-20 04:34:32.068 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] publish_stats_to_osd reporting purged_snaps []
 -5623> 2020-07-20 04:34:32.068 7f351e329700 15 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] publish_stats_to_osd 667:15701
 -5622> 2020-07-20 04:34:32.068 7f351e329700 10 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] choose_acting all_info osd.3 6.1f( v 667'8335 (583'5325,667'8335] local-lis/les=665/666 n=-1 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665)
 -5621> 2020-07-20 04:34:32.068 7f351e329700 10 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] choose_acting all_info osd.5 6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665)
 -5620> 2020-07-20 04:34:32.068 7f351e329700 10 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] choose_acting all_info osd.6 6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665)
 -5619> 2020-07-20 04:34:32.068 7f351e329700 10 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] calc_replicated_acting newest update on osd.5 with 6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) restrict_to_up_acting
 -5618> 2020-07-20 04:34:32.068 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] choose_async_recovery_replicated candidates by cost are: 1,3
 -5617> 2020-07-20 04:34:32.068 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] choose_async_recovery_replicated final candidates by cost are: 1,3
 -5616> 2020-07-20 04:34:32.068 7f351e329700 20 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] choose_async_recovery_replicated result want=[5,6] async_recovery=3
 -5615> 2020-07-20 04:34:32.068 7f351e329700 10 osd.5 pg_epoch: 667 pg[6.1f( v 667'8335 (577'5259,667'8335] local-lis/les=665/666 n=0 ec=546/498 lis/c 665/619 les/c/f 666/620/0 664/665/665) [3,6]/[5,6] backfill=[3] r=0 lpr=665 pi=[619,665)/1 luod=666'8334 crt=667'8335 lcod 666'8333 mlcod 666'8333 active+remapped mbc={255={}}] acting_recovery_backfill is 3,5,6
   -10> 2020-07-20 04:34:33.298 7f351e329700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.10-211-g951f3f726d7/rpm/el7/BUILD/ceph-14.2.10-211-g951f3f726d7/src/osd/PG.cc: In function 'bool PG::choose_acting(pg_shard_t&, bool, bool*, bool)' thread 7f351e329700 time 2020-07-20 04:34:33.175080
   -10> 2020-07-20 04:34:33.298 7f351e329700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.10-211-g951f3f726d7/rpm/el7/BUILD/ceph-14.2.10-211-g951f3f726d7/src/osd/PG.cc: In function 'bool PG::choose_acting(pg_shard_t&, bool, bool*, bool)' thread 7f351e329700 time 2020-07-20 04:34:33.175080
    -9> 2020-07-20 04:34:33.325 7f351e329700 -1 *** Caught signal (Aborted) **

Related issues

Copied to RADOS - Backport #46709: octopus: Negative peer_num_objects crashes osd Resolved
Copied to RADOS - Backport #46710: nautilus: Negative peer_num_objects crashes osd Resolved

History

#1 Updated by Kefu Chai 4 months ago

  • Status changed from New to Pending Backport
  • Pull request ID set to 36274

#2 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #46709: octopus: Negative peer_num_objects crashes osd added

#3 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #46710: nautilus: Negative peer_num_objects crashes osd added

#4 Updated by Nathan Cutler 2 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF