Bug #56772
crash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(clone))
0%
024700b37b3bc297e5ee455cf52e654a74ae210b2b22373907a02cc03ead01dc
06e6bbcc8c71a6aa76e8dab243a863e6c78e9514a984c55ec83e41d61a3992dd
42f9a44e4bf33699555adb6d0b49c1591a9863e12b85c55bbcc9f242cb2a22c0
60725542a1c89bcbbef2c4064b394e12a97f163a26965da4bd3f6bb9c8c6cf61
6498d1dfa7bd41871364749773d98ed09954154af5fb9f2c980d9a113a52cbb2
6c68454e46110849a860c677e714543dae8e2887aa732ee649f4fc377c06569d
8a3841a95de3adc1f0f138d0a8e924e408431da8cbfd0087e4e7e6b27f59e2f7
8a614bf022366c6396ef6fa003d0f34aa3ad200648d22e4ba6927fc0708dedce
ed2559dbe1bf62cdd1e035c0cd0fe992b9b71f38ecb101cbb45d0e2a035a52b2
f6ba2de1146c829d35872b7eeb49196b5bfa59250e5e4d07805b53ce0d5ee7ce
Description
Assert condition: clone_overlap.count(clone)
Assert function: uint64_t SnapSet::get_clone_bytes(snapid_t) const
Sanitized backtrace:
SnapSet::get_clone_bytes(snapid_t) const PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*) PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*) PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*) OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&) ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ShardedThreadPool::shardedthreadpool_worker(unsigned int) ShardedThreadPool::WorkThreadSharded::entry()
Crash dump sample:
{ "assert_condition": "clone_overlap.count(clone)", "assert_file": "osd/osd_types.cc", "assert_func": "uint64_t SnapSet::get_clone_bytes(snapid_t) const", "assert_line": 5783, "assert_msg": "osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fe44f446700 time 2022-07-26T23:38:26.703737+1200\nosd/osd_types.cc: 5783: FAILED ceph_assert(clone_overlap.count(clone))", "assert_thread_name": "tp_osd_tp", "backtrace": [ "/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fe46c5c0140]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x564bdcba0e78]", "/usr/bin/ceph-osd(+0xac0fb9) [0x564bdcba0fb9]", "(SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x564bdce988b3]", "(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x564bdcd8f16e]", "(PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x17d5) [0x564bdcdf56d5]", "(PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf35) [0x564bdcdfb3f5]", "(OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x564bdcc6be05]", "(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x564bdcedc019]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x564bdcc89367]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x564bdd3323da]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564bdd3349b0]", "/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fe46c5b4ea7]", "clone()" ], "ceph_version": "16.2.9", "crash_id": "2022-07-26T11:38:26.726934Z_7e80ef3c-e1c0-4957-9ca3-e4fc548200da", "entity_name": "osd.5dd74b9e9d0b76a491b6a1cc86dffd15caf7242f", "os_id": "11", "os_name": "Debian GNU/Linux 11 (bullseye)", "os_version": "11 (bullseye)", "os_version_id": "11", "process_name": "ceph-osd", "stack_sig": "024700b37b3bc297e5ee455cf52e654a74ae210b2b22373907a02cc03ead01dc", "timestamp": "2022-07-26T11:38:26.726934Z", "utsname_machine": "x86_64", "utsname_release": "5.15.39-1-pve", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200)" }
Related issues
History
#1 Updated by Telemetry Bot over 1 year ago
#2 Updated by Radoslaw Zarzynski about 1 year ago
- Duplicated by Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed added
#3 Updated by Thomas Le Gentil about 1 year ago
This bug is present in v17.2.5
#4 Updated by Thomas Le Gentil 12 months ago
Hi, Would it be possible to raise the priority of this bug to High (as well as #57940), as this prevent the incomplete pg to recover and makes the pool not available?
Thanks
#5 Updated by huy nguyen 10 months ago
Hi, the OSD to crash whenever it tries to backfill to target OSDs. If the situation persists, it may cause data loss.
For now, I have to manually backfill it with the ceph-objectstore-tool. But I think the issue will appear again in the future.
So can this case be raised to high priority?
Thanks
#6 Updated by Telemetry Bot 7 months ago
- Affected Versions v16.2.11, v17.2.1, v17.2.4, v17.2.5 added
#7 Updated by Achim Ledermüller 5 months ago
Hi,
we have the same issue with version 14.2.22. Please raise the issue to a higher priority!
@Huy Nguyen
Can you please describe how you did the manual backfill? Just a plain export/import with ceph-objectstore-tool like:
ceph-objectstore-tool --op export --pgid <id> --data-path <path> --journal-path <path> --file <id>.export
Can I import the PG at any OSD in the `up` set of the PG? Or maybe should I use a new OSD with weight 0?
Kind regards,
Achim