Project

General

Profile

Actions

Bug #52535

open

monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)

Added by Guillaume Abrioux over 2 years ago. Updated 3 months ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

seeing failures in ceph-volume CI because of monitor crashing after an OSD gets destroyed.

Sep 08 08:59:55 mon0 ceph-mon[10306]: *** Caught signal (Aborted) **
Sep 08 08:59:55 mon0 ceph-mon[10306]:  in thread 7ff579e98700 thread_name:ms_dispatch
Sep 08 08:59:55 mon0 ceph-mon[10306]: 2021-09-08T08:59:55.903+0000 7ff579e98700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7ff579e98700 time 2021-09-08T08:59:55.899203+0000
Sep 08 08:59:55 mon0 ceph-mon[10306]: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Sep 08 08:59:55 mon0 ceph-mon[10306]:  ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)

ceph version:

# ceph --version
ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)

ceph crash details:

{
    "crash_id": "2021-09-08T08:15:01.060487Z_b1737f83-054e-45f0-a339-0a5a5b6e835b",
    "timestamp": "2021-09-08T08:15:01.060487Z",
    "process_name": "ceph-mon",
    "entity_name": "mon.mon0",
    "ceph_version": "17.0.0-7502-gca906d0d",
    "utsname_hostname": "mon0",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-240.1.1.el8_3.x86_64",
    "utsname_version": "#1 SMP Thu Nov 19 17:20:08 UTC 2020",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8",
    "assert_condition": "num_down_in_osds <= num_in_osds",
    "assert_func": "void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc",
    "assert_line": 5686,
    "assert_thread_name": "ms_dispatch",
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f5b2c1b5700 time 2021-09-08T08:15:01.033797+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)\n",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f5b37577b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f5b39846c8c]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x284e4f) [0x7f5b39846e4f]",
        "(OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x3e5e) [0x7f5b39cfd11e]",
        "(OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x3d03) [0x55f87315b5f3]",
        "(PaxosService::propose_pending()+0x21a) [0x55f8730d625a]",
        "(PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xbd3) [0x55f8730d7273]",
        "(Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x27ac) [0x55f872f9df1c]",
        "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xb22) [0x55f872fa2e82]",
        "(Monitor::_ms_dispatch(Message*)+0x457) [0x55f872fa4117]",
        "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55f872fd459c]",
        "(DispatchQueue::entry()+0x14fa) [0x7f5b39ac4e5a]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f5b39b7a161]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f5b3756d14a]",
        "clone()" 
    ]
}

The monitor daemon can't restart after that, it keeps crashing like this in a loop.

see attachments for full log

https://jenkins.ceph.com/job/ceph-volume-prs-lvm-centos8-filestore-create/240/consoleFull


Files

log (60.6 KB) log Guillaume Abrioux, 09/08/2021 08:56 AM
meta (2.93 KB) meta Guillaume Abrioux, 09/08/2021 08:57 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #19989: "OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados runResolved

Actions
Actions

Also available in: Atom PDF