Project

General

Profile

Actions

Bug #47299

open

Assertion in pg_missing_set: p->second.need <= v || p->second.is_delete()

Added by Denis Krienbühl over 3 years ago. Updated almost 2 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):

97fd8dff92c029abb7aa00e77a6af85ca1b3d963876c082a4f5a579418498427
ea5d5202cbfd479dad466304faaf11b74609c65a810668ae789b64ec19b8be0d
f582692869a94580abf07e6695f97d0a75b4174a2d1727abee7acfb06a234e32
ffcf6fed9c89d0cc0b574d879c8d8d999500d074a062d6e1de03b68fd9eedfbe


Description

Some of our ODSs will sometimes crash with the following message:

{
    "crash_id": "2020-09-04T04:16:06.718363Z_78670637-4fad-494f-abea-2afd6e64a970",
    "timestamp": "2020-09-04T04:16:06.718363Z",
    "process_name": "ceph-osd",
    "entity_name": "osd.166",
    "ceph_version": "15.2.4",
    "utsname_hostname": "prod-nvme1-c-rma1",
    "utsname_sysname": "Linux",
    "utsname_release": "4.15.0-72-generic",
    "utsname_version": "#81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019",
    "utsname_machine": "x86_64",
    "os_name": "Ubuntu",
    "os_id": "ubuntu",
    "os_version_id": "18.04",
    "os_version": "18.04.3 LTS (Bionic Beaver)",
    "assert_condition": "p->second.need <= v || p->second.is_delete()",
    "assert_func": "void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]",
    "assert_file": "/build/ceph-15.2.4/src/osd/osd_types.h",
    "assert_line": 4774,
    "assert_thread_name": "tp_osd_tp",
    "assert_msg": "/build/ceph-15.2.4/src/osd/osd_types.h: In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]' thread 7fd8cfcd8700 time 2020-09-04T06:16:06.699081+0200\n/build/ceph-15.2.4/src/osd/osd_types.h: 4774: FAILED ceph_assert(p->second.need <= v || p->second.is_delete())\n",
    "backtrace": [
        "(()+0x12890) [0x7fd8f7255890]",
        "(gsignal()+0xc7) [0x7fd8f5f07e97]",
        "(abort()+0x141) [0x7fd8f5f09801]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x560dc2d9a7b5]",
        "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x560dc2d9a93f]",
        "(()+0xc3d9c8) [0x560dc307b9c8]",
        "(ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x591) [0x560dc3159f31]",
        "(ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0xfa) [0x560dc315a67a]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x177) [0x560dc315c737]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x560dc2fe2867]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6fd) [0x560dc2f853fd]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x560dc2e09aab]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x67) [0x560dc3066d07]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90c) [0x560dc2e2751c]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x560dc34713ec]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560dc3474640]",
        "(()+0x76db) [0x7fd8f724a6db]",
        "(clone()+0x3f) [0x7fd8f5fea88f]" 
    ]
}

Before and after the crash, we'll see a of errors like this being spammed into the logs:

1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd8d6ce6700' had timed out after 15

At this point we recreate the OSD, which solves the problem.


Files

crash.log.gz (169 KB) crash.log.gz Denis Krienbühl, 09/04/2020 06:57 AM
crash2.txt (152 KB) crash2.txt Tobias Urdin, 05/03/2021 12:07 PM

Related issues 1 (0 open1 closed)

Has duplicate RADOS - Bug #52180: crash: void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]: assert(p->second.need <= v || p->second.is_delete())Duplicate

Actions
Actions

Also available in: Atom PDF