Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed - RADOS - Ceph

Bug #57940

Hi, I have this current crash: 

 I've experienced a disk failure in my ceph cluster. 
 I've replaced the disk, but now with the rebalancing / backfilling, one OSD crashes (osd.1). 

 When I set the 'nobackfill' flag, the osd does not crash and does crash right after the flag is removed. 
 The crash from the log looks like https://tracker.ceph.com/issues/56772 

 I've put the 'complete' log in attachment, here is the last part of the crash : 

 @ceph version 17.2.4 (b26dd582fcc41389ea06191f19e88eed6eccea5b) quincy (stable) 
  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7ff3bc315140] 
  2: gsignal() 
  3: abort() 
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x565256aaffca] 
  5: /usr/bin/ceph-osd(+0xc2310e) [0x565256ab010e] 
  6: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x565256df05f3] 
  7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x565256c9a94e] 
  8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f3) [0x565256d05543] 
  9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x565256d0b42a] 
  10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x565256b7a175] 
  11: (ceph:sd::scheduler:GRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x565256e34879] 
  12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xad0) [0x565256b9abc0] 
  13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x56525727dc1a] 
  14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5652572801f0] 
  15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff3bc309ea7] 
  16: clone()@ 

 <pre> 
     -1> 2022-10-25T21:05:48.188+0200 7ff39e1b7700 -1 ./src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7ff39e1b7700 time 2022-10-25T21:05:48.184867+0200 
 ./src/osd/osd_types.cc: 5888: FAILED ceph_assert(clone_overlap.count(clone)) 

  ceph version 17.2.4 (b26dd582fcc41389ea06191f19e88eed6eccea5b) quincy (stable) 
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x565256aaff70] 
  2: /usr/bin/ceph-osd(+0xc2310e) [0x565256ab010e] 
  3: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x565256df05f3] 
  4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x565256c9a94e] 
  5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f3) [0x565256d05543] 
  6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x565256d0b42a] 
  7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x565256b7a175] 
  8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x565256e34879] 
  9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xad0) [0x565256b9abc0] 
  10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x56525727dc1a] 
  11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5652572801f0] 
  12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff3bc309ea7] 
  13: clone() 
 </pre>

Back

Project

General

Profile

Ceph » RADOS

Bug #57940