Project

General

Profile

Bug #47299

Assertion in pg_missing_set: p->second.need <= v || p->second.is_delete()

Added by Denis Krienbühl 16 days ago. Updated 14 days ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature:

Description

Some of our ODSs will sometimes crash with the following message:

{
    "crash_id": "2020-09-04T04:16:06.718363Z_78670637-4fad-494f-abea-2afd6e64a970",
    "timestamp": "2020-09-04T04:16:06.718363Z",
    "process_name": "ceph-osd",
    "entity_name": "osd.166",
    "ceph_version": "15.2.4",
    "utsname_hostname": "prod-nvme1-c-rma1",
    "utsname_sysname": "Linux",
    "utsname_release": "4.15.0-72-generic",
    "utsname_version": "#81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019",
    "utsname_machine": "x86_64",
    "os_name": "Ubuntu",
    "os_id": "ubuntu",
    "os_version_id": "18.04",
    "os_version": "18.04.3 LTS (Bionic Beaver)",
    "assert_condition": "p->second.need <= v || p->second.is_delete()",
    "assert_func": "void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]",
    "assert_file": "/build/ceph-15.2.4/src/osd/osd_types.h",
    "assert_line": 4774,
    "assert_thread_name": "tp_osd_tp",
    "assert_msg": "/build/ceph-15.2.4/src/osd/osd_types.h: In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]' thread 7fd8cfcd8700 time 2020-09-04T06:16:06.699081+0200\n/build/ceph-15.2.4/src/osd/osd_types.h: 4774: FAILED ceph_assert(p->second.need <= v || p->second.is_delete())\n",
    "backtrace": [
        "(()+0x12890) [0x7fd8f7255890]",
        "(gsignal()+0xc7) [0x7fd8f5f07e97]",
        "(abort()+0x141) [0x7fd8f5f09801]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x560dc2d9a7b5]",
        "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x560dc2d9a93f]",
        "(()+0xc3d9c8) [0x560dc307b9c8]",
        "(ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x591) [0x560dc3159f31]",
        "(ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0xfa) [0x560dc315a67a]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x177) [0x560dc315c737]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x560dc2fe2867]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6fd) [0x560dc2f853fd]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x560dc2e09aab]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x67) [0x560dc3066d07]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90c) [0x560dc2e2751c]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x560dc34713ec]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560dc3474640]",
        "(()+0x76db) [0x7fd8f724a6db]",
        "(clone()+0x3f) [0x7fd8f5fea88f]" 
    ]
}

Before and after the crash, we'll see a of errors like this being spammed into the logs:

1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd8d6ce6700' had timed out after 15

At this point we recreate the OSD, which solves the problem.

crash.log.gz (169 KB) Denis Krienbühl, 09/04/2020 06:57 AM

History

#1 Updated by Neha Ojha 15 days ago

  • Status changed from New to Need More Info

Is it possible for you to capture osd logs with debug_osd=30? We'll also try to reproduce this at our end.

#2 Updated by Denis Krienbühl 14 days ago

Neha Ojha wrote:

Is it possible for you to capture osd logs with debug_osd=30? We'll also try to reproduce this at our end.

Unfortunately we already reset the OSD, as we've been suffering from a number of problems with our cluster and didn't have a chance to keep this in a broken state for investigation. We expect that we may run into further issues like this though, and if permits we'll happily provide more output.

Also available in: Atom PDF