Bug #56772
opencrash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(clone))
0%
024700b37b3bc297e5ee455cf52e654a74ae210b2b22373907a02cc03ead01dc
06e6bbcc8c71a6aa76e8dab243a863e6c78e9514a984c55ec83e41d61a3992dd
42f9a44e4bf33699555adb6d0b49c1591a9863e12b85c55bbcc9f242cb2a22c0
60725542a1c89bcbbef2c4064b394e12a97f163a26965da4bd3f6bb9c8c6cf61
6498d1dfa7bd41871364749773d98ed09954154af5fb9f2c980d9a113a52cbb2
6c68454e46110849a860c677e714543dae8e2887aa732ee649f4fc377c06569d
8a3841a95de3adc1f0f138d0a8e924e408431da8cbfd0087e4e7e6b27f59e2f7
8a614bf022366c6396ef6fa003d0f34aa3ad200648d22e4ba6927fc0708dedce
ed2559dbe1bf62cdd1e035c0cd0fe992b9b71f38ecb101cbb45d0e2a035a52b2
f6ba2de1146c829d35872b7eeb49196b5bfa59250e5e4d07805b53ce0d5ee7ce
Description
Assert condition: clone_overlap.count(clone)
Assert function: uint64_t SnapSet::get_clone_bytes(snapid_t) const
Sanitized backtrace:
SnapSet::get_clone_bytes(snapid_t) const PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*) PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*) PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*) OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&) ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ShardedThreadPool::shardedthreadpool_worker(unsigned int) ShardedThreadPool::WorkThreadSharded::entry()
Crash dump sample:
{ "assert_condition": "clone_overlap.count(clone)", "assert_file": "osd/osd_types.cc", "assert_func": "uint64_t SnapSet::get_clone_bytes(snapid_t) const", "assert_line": 5783, "assert_msg": "osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fe44f446700 time 2022-07-26T23:38:26.703737+1200\nosd/osd_types.cc: 5783: FAILED ceph_assert(clone_overlap.count(clone))", "assert_thread_name": "tp_osd_tp", "backtrace": [ "/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fe46c5c0140]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x564bdcba0e78]", "/usr/bin/ceph-osd(+0xac0fb9) [0x564bdcba0fb9]", "(SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x564bdce988b3]", "(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x564bdcd8f16e]", "(PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x17d5) [0x564bdcdf56d5]", "(PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf35) [0x564bdcdfb3f5]", "(OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x564bdcc6be05]", "(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x564bdcedc019]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x564bdcc89367]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x564bdd3323da]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564bdd3349b0]", "/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fe46c5b4ea7]", "clone()" ], "ceph_version": "16.2.9", "crash_id": "2022-07-26T11:38:26.726934Z_7e80ef3c-e1c0-4957-9ca3-e4fc548200da", "entity_name": "osd.5dd74b9e9d0b76a491b6a1cc86dffd15caf7242f", "os_id": "11", "os_name": "Debian GNU/Linux 11 (bullseye)", "os_version": "11 (bullseye)", "os_version_id": "11", "process_name": "ceph-osd", "stack_sig": "024700b37b3bc297e5ee455cf52e654a74ae210b2b22373907a02cc03ead01dc", "timestamp": "2022-07-26T11:38:26.726934Z", "utsname_machine": "x86_64", "utsname_release": "5.15.39-1-pve", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200)" }
Updated by Telemetry Bot over 1 year ago
Updated by Radoslaw Zarzynski over 1 year ago
- Has duplicate Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed added
Updated by Thomas Le Gentil over 1 year ago
Hi, Would it be possible to raise the priority of this bug to High (as well as #57940), as this prevent the incomplete pg to recover and makes the pool not available?
Thanks
Updated by Huy Nguyen about 1 year ago
Hi, the OSD to crash whenever it tries to backfill to target OSDs. If the situation persists, it may cause data loss.
For now, I have to manually backfill it with the ceph-objectstore-tool. But I think the issue will appear again in the future.
So can this case be raised to high priority?
Thanks
Updated by Telemetry Bot 12 months ago
- Affected Versions v16.2.11, v17.2.1, v17.2.4, v17.2.5 added
Updated by Achim Ledermüller 9 months ago
Hi,
we have the same issue with version 14.2.22. Please raise the issue to a higher priority!
@Huy Nguyen
Can you please describe how you did the manual backfill? Just a plain export/import with ceph-objectstore-tool like:
ceph-objectstore-tool --op export --pgid <id> --data-path <path> --journal-path <path> --file <id>.export
Can I import the PG at any OSD in the `up` set of the PG? Or maybe should I use a new OSD with weight 0?
Kind regards,
Achim