Project

General

Profile

Actions

Bug #56772

open

crash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(clone))

Added by Telemetry Bot over 1 year ago. Updated 9 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Telemetry
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):

024700b37b3bc297e5ee455cf52e654a74ae210b2b22373907a02cc03ead01dc
06e6bbcc8c71a6aa76e8dab243a863e6c78e9514a984c55ec83e41d61a3992dd
42f9a44e4bf33699555adb6d0b49c1591a9863e12b85c55bbcc9f242cb2a22c0
60725542a1c89bcbbef2c4064b394e12a97f163a26965da4bd3f6bb9c8c6cf61
6498d1dfa7bd41871364749773d98ed09954154af5fb9f2c980d9a113a52cbb2
6c68454e46110849a860c677e714543dae8e2887aa732ee649f4fc377c06569d
8a3841a95de3adc1f0f138d0a8e924e408431da8cbfd0087e4e7e6b27f59e2f7
8a614bf022366c6396ef6fa003d0f34aa3ad200648d22e4ba6927fc0708dedce
ed2559dbe1bf62cdd1e035c0cd0fe992b9b71f38ecb101cbb45d0e2a035a52b2
f6ba2de1146c829d35872b7eeb49196b5bfa59250e5e4d07805b53ce0d5ee7ce


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=62b8a9e7f0bb7fc1fc81b2dcd9ceba2ba36ab9e25e03f08224b6946f9a1fc9d4

Assert condition: clone_overlap.count(clone)
Assert function: uint64_t SnapSet::get_clone_bytes(snapid_t) const

Sanitized backtrace:

    SnapSet::get_clone_bytes(snapid_t) const
    PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)
    PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)
    PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)
    OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)
    ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
    OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
    ShardedThreadPool::shardedthreadpool_worker(unsigned int)
    ShardedThreadPool::WorkThreadSharded::entry()

Crash dump sample:
{
    "assert_condition": "clone_overlap.count(clone)",
    "assert_file": "osd/osd_types.cc",
    "assert_func": "uint64_t SnapSet::get_clone_bytes(snapid_t) const",
    "assert_line": 5783,
    "assert_msg": "osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fe44f446700 time 2022-07-26T23:38:26.703737+1200\nosd/osd_types.cc: 5783: FAILED ceph_assert(clone_overlap.count(clone))",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fe46c5c0140]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x564bdcba0e78]",
        "/usr/bin/ceph-osd(+0xac0fb9) [0x564bdcba0fb9]",
        "(SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x564bdce988b3]",
        "(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x564bdcd8f16e]",
        "(PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x17d5) [0x564bdcdf56d5]",
        "(PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf35) [0x564bdcdfb3f5]",
        "(OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x564bdcc6be05]",
        "(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x564bdcedc019]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x564bdcc89367]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x564bdd3323da]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564bdd3349b0]",
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fe46c5b4ea7]",
        "clone()" 
    ],
    "ceph_version": "16.2.9",
    "crash_id": "2022-07-26T11:38:26.726934Z_7e80ef3c-e1c0-4957-9ca3-e4fc548200da",
    "entity_name": "osd.5dd74b9e9d0b76a491b6a1cc86dffd15caf7242f",
    "os_id": "11",
    "os_name": "Debian GNU/Linux 11 (bullseye)",
    "os_version": "11 (bullseye)",
    "os_version_id": "11",
    "process_name": "ceph-osd",
    "stack_sig": "024700b37b3bc297e5ee455cf52e654a74ae210b2b22373907a02cc03ead01dc",
    "timestamp": "2022-07-26T11:38:26.726934Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.39-1-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200)" 
}


Related issues 1 (0 open1 closed)

Has duplicate RADOS - Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removedDuplicate

Actions
Actions #1

Updated by Telemetry Bot over 1 year ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v15.2.0, v15.2.11, v15.2.12, v15.2.15, v15.2.5, v15.2.8, v15.2.9, v16.2.1, v16.2.7, v16.2.9 added
Actions #2

Updated by Radoslaw Zarzynski over 1 year ago

  • Has duplicate Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed added
Actions #3

Updated by Thomas Le Gentil over 1 year ago

This bug is present in v17.2.5

Actions #4

Updated by Thomas Le Gentil over 1 year ago

Hi, Would it be possible to raise the priority of this bug to High (as well as #57940), as this prevent the incomplete pg to recover and makes the pool not available?

Thanks

Actions #5

Updated by Huy Nguyen about 1 year ago

Hi, the OSD to crash whenever it tries to backfill to target OSDs. If the situation persists, it may cause data loss.
For now, I have to manually backfill it with the ceph-objectstore-tool. But I think the issue will appear again in the future.

So can this case be raised to high priority?

Thanks

Actions #6

Updated by Telemetry Bot 12 months ago

  • Affected Versions v16.2.11, v17.2.1, v17.2.4, v17.2.5 added
Actions #7

Updated by Achim Ledermüller 9 months ago

Hi,

we have the same issue with version 14.2.22. Please raise the issue to a higher priority!

@Huy Nguyen

Can you please describe how you did the manual backfill? Just a plain export/import with ceph-objectstore-tool like:

ceph-objectstore-tool --op export --pgid <id> --data-path <path> --journal-path <path> --file <id>.export

Can I import the PG at any OSD in the `up` set of the PG? Or maybe should I use a new OSD with weight 0?

Kind regards,
Achim

Actions

Also available in: Atom PDF