Bug #56772: crash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(clone)) - RADOS - Ceph

Actions

Copy link

Bug #56772

open

crash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(clone))

Added by Telemetry Bot over 1 year ago. Updated 9 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Telemetry

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.0, Ceph - v15.2.11, Ceph - v15.2.12, Ceph - v15.2.15, Ceph - v15.2.5, Ceph - v15.2.8, Ceph - v15.2.9, Ceph - v16.2.1, Ceph - v16.2.11, Ceph - v16.2.7, Ceph - v16.2.9, Ceph - v17.2.1, Ceph - v17.2.4, Ceph - v17.2.5

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

024700b37b3bc297e5ee455cf52e654a74ae210b2b22373907a02cc03ead01dc
06e6bbcc8c71a6aa76e8dab243a863e6c78e9514a984c55ec83e41d61a3992dd
42f9a44e4bf33699555adb6d0b49c1591a9863e12b85c55bbcc9f242cb2a22c0
60725542a1c89bcbbef2c4064b394e12a97f163a26965da4bd3f6bb9c8c6cf61
6498d1dfa7bd41871364749773d98ed09954154af5fb9f2c980d9a113a52cbb2
6c68454e46110849a860c677e714543dae8e2887aa732ee649f4fc377c06569d
8a3841a95de3adc1f0f138d0a8e924e408431da8cbfd0087e4e7e6b27f59e2f7
8a614bf022366c6396ef6fa003d0f34aa3ad200648d22e4ba6927fc0708dedce
ed2559dbe1bf62cdd1e035c0cd0fe992b9b71f38ecb101cbb45d0e2a035a52b2
f6ba2de1146c829d35872b7eeb49196b5bfa59250e5e4d07805b53ce0d5ee7ce

Crash signature (v2):

62b8a9e7f0bb7fc1fc81b2dcd9ceba2ba36ab9e25e03f08224b6946f9a1fc9d4

Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=62b8a9e7f0bb7fc1fc81b2dcd9ceba2ba36ab9e25e03f08224b6946f9a1fc9d4

Assert condition: clone_overlap.count(clone)
Assert function: uint64_t SnapSet::get_clone_bytes(snapid_t) const

Sanitized backtrace:

    SnapSet::get_clone_bytes(snapid_t) const
    PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)
    PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)
    PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)
    OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)
    ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
    OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
    ShardedThreadPool::shardedthreadpool_worker(unsigned int)
    ShardedThreadPool::WorkThreadSharded::entry()

Crash dump sample:

{
    "assert_condition": "clone_overlap.count(clone)",
    "assert_file": "osd/osd_types.cc",
    "assert_func": "uint64_t SnapSet::get_clone_bytes(snapid_t) const",
    "assert_line": 5783,
    "assert_msg": "osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fe44f446700 time 2022-07-26T23:38:26.703737+1200\nosd/osd_types.cc: 5783: FAILED ceph_assert(clone_overlap.count(clone))",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fe46c5c0140]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x564bdcba0e78]",
        "/usr/bin/ceph-osd(+0xac0fb9) [0x564bdcba0fb9]",
        "(SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x564bdce988b3]",
        "(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x564bdcd8f16e]",
        "(PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x17d5) [0x564bdcdf56d5]",
        "(PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf35) [0x564bdcdfb3f5]",
        "(OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x564bdcc6be05]",
        "(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x564bdcedc019]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x564bdcc89367]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x564bdd3323da]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564bdd3349b0]",
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fe46c5b4ea7]",
        "clone()" 
    ],
    "ceph_version": "16.2.9",
    "crash_id": "2022-07-26T11:38:26.726934Z_7e80ef3c-e1c0-4957-9ca3-e4fc548200da",
    "entity_name": "osd.5dd74b9e9d0b76a491b6a1cc86dffd15caf7242f",
    "os_id": "11",
    "os_name": "Debian GNU/Linux 11 (bullseye)",
    "os_version": "11 (bullseye)",
    "os_version_id": "11",
    "process_name": "ceph-osd",
    "stack_sig": "024700b37b3bc297e5ee455cf52e654a74ae210b2b22373907a02cc03ead01dc",
    "timestamp": "2022-07-26T11:38:26.726934Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.39-1-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200)" 
}

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Telemetry Bot over 1 year ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)
Affected Versions v15.2.0, v15.2.11, v15.2.12, v15.2.15, v15.2.5, v15.2.8, v15.2.9, v16.2.1, v16.2.7, v16.2.9 added

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Has duplicate Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed added

Actions

Copy link

Updated by Thomas Le Gentil over 1 year ago

This bug is present in v17.2.5

Actions

Copy link

Updated by Thomas Le Gentil over 1 year ago

Hi, Would it be possible to raise the priority of this bug to High (as well as #57940), as this prevent the incomplete pg to recover and makes the pool not available?

Thanks

Actions

Copy link

Updated by Huy Nguyen about 1 year ago

Hi, the OSD to crash whenever it tries to backfill to target OSDs. If the situation persists, it may cause data loss.
For now, I have to manually backfill it with the ceph-objectstore-tool. But I think the issue will appear again in the future.

So can this case be raised to high priority?

Thanks

Actions

Copy link

Updated by Telemetry Bot 12 months ago

Affected Versions v16.2.11, v17.2.1, v17.2.4, v17.2.5 added

Actions

Copy link

Updated by Achim Ledermüller 9 months ago

Hi,

we have the same issue with version 14.2.22. Please raise the issue to a higher priority!

@Huy Nguyen

Can you please describe how you did the manual backfill? Just a plain export/import with ceph-objectstore-tool like:

ceph-objectstore-tool --op export --pgid <id> --data-path <path> --journal-path <path> --file <id>.export

Can I import the PG at any OSD in the `up` set of the PG? Or maybe should I use a new OSD with weight 0?

Kind regards,
Achim

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #56772

crash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(clone))

Updated by Telemetry Bot over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago

Updated by Thomas Le Gentil over 1 year ago

Updated by Thomas Le Gentil over 1 year ago

Updated by Huy Nguyen about 1 year ago

Updated by Telemetry Bot 12 months ago

Updated by Achim Ledermüller 9 months ago