Project

General

Profile

Actions

Bug #52535

open

monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)

Added by Guillaume Abrioux over 2 years ago. Updated 3 months ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

seeing failures in ceph-volume CI because of monitor crashing after an OSD gets destroyed.

Sep 08 08:59:55 mon0 ceph-mon[10306]: *** Caught signal (Aborted) **
Sep 08 08:59:55 mon0 ceph-mon[10306]:  in thread 7ff579e98700 thread_name:ms_dispatch
Sep 08 08:59:55 mon0 ceph-mon[10306]: 2021-09-08T08:59:55.903+0000 7ff579e98700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7ff579e98700 time 2021-09-08T08:59:55.899203+0000
Sep 08 08:59:55 mon0 ceph-mon[10306]: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Sep 08 08:59:55 mon0 ceph-mon[10306]:  ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)

ceph version:

# ceph --version
ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)

ceph crash details:

{
    "crash_id": "2021-09-08T08:15:01.060487Z_b1737f83-054e-45f0-a339-0a5a5b6e835b",
    "timestamp": "2021-09-08T08:15:01.060487Z",
    "process_name": "ceph-mon",
    "entity_name": "mon.mon0",
    "ceph_version": "17.0.0-7502-gca906d0d",
    "utsname_hostname": "mon0",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-240.1.1.el8_3.x86_64",
    "utsname_version": "#1 SMP Thu Nov 19 17:20:08 UTC 2020",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8",
    "assert_condition": "num_down_in_osds <= num_in_osds",
    "assert_func": "void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc",
    "assert_line": 5686,
    "assert_thread_name": "ms_dispatch",
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f5b2c1b5700 time 2021-09-08T08:15:01.033797+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)\n",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f5b37577b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f5b39846c8c]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x284e4f) [0x7f5b39846e4f]",
        "(OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x3e5e) [0x7f5b39cfd11e]",
        "(OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x3d03) [0x55f87315b5f3]",
        "(PaxosService::propose_pending()+0x21a) [0x55f8730d625a]",
        "(PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xbd3) [0x55f8730d7273]",
        "(Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x27ac) [0x55f872f9df1c]",
        "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xb22) [0x55f872fa2e82]",
        "(Monitor::_ms_dispatch(Message*)+0x457) [0x55f872fa4117]",
        "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55f872fd459c]",
        "(DispatchQueue::entry()+0x14fa) [0x7f5b39ac4e5a]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f5b39b7a161]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f5b3756d14a]",
        "clone()" 
    ]
}

The monitor daemon can't restart after that, it keeps crashing like this in a loop.

see attachments for full log

https://jenkins.ceph.com/job/ceph-volume-prs-lvm-centos8-filestore-create/240/consoleFull


Files

log (60.6 KB) log Guillaume Abrioux, 09/08/2021 08:56 AM
meta (2.93 KB) meta Guillaume Abrioux, 09/08/2021 08:57 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #19989: "OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados runResolved

Actions
Actions #1

Updated by Sebastian Wagner over 2 years ago

  • Related to Bug #19989: "OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run added
Actions #2

Updated by Sebastian Wagner over 2 years ago

  • Description updated (diff)
Actions #3

Updated by Sebastian Wagner over 2 years ago

  • Priority changed from Normal to High

Increasing priority, as this happens pretty often in the ceph-volume jenkins jobs recently

Actions #4

Updated by Sebastian Wagner over 2 years ago

  • Subject changed from monitor crashes after an OSD got destroyed to monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Actions #5

Updated by Neha Ojha over 2 years ago

The log attached has a sha1 ca906d0d7a65c8a598d397b764dd262cce645fe3, is this the first time you encountered this issue?
Has this test always been there? Just trying to understand if there was a new ceph change that broke it.
Is it possible to raise debug levels for the tests that reproduce this issue?

Actions #6

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Need More Info
  • Priority changed from High to Normal
Actions #7

Updated by Sebastian Mazza about 2 years ago

I Faced the same problem with ceph version 16.2.6. It occurred after shutting down all 3 physical servers of the cluster. After booting the physical servers agin 2 out of 3 Monitors crashes every view seconds. Every server has 3 OSDs, one NVMe and two HDD based. A reboot of all 3 physical servers did not solve the Problem. So, 2 of the 3 Monitors wher permanently crashing also after a reboot of all 3 physical servers. However, stopping every HDD based OSD and one of the NVME based OSDs "solved" the problem. The Problem did not come back after starting the OSDs one after the other.

Example from Syslog:


Mar  9 00:44:31 dionysos ceph-mon[11308]:     -1> 2022-03-09T00:44:31.173+0100 7f35ad9e1700 -1 ./src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f35ad9e1700 time 2022-03-09T00:44:31.174227+0100
Mar  9 00:44:31 dionysos ceph-mon[11308]: ./src/osd/OSDMap.cc: 5696: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Mar  9 00:44:31 dionysos ceph-mon[11308]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Mar  9 00:44:31 dionysos ceph-mon[11308]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7f35b43921b4]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  2: /usr/lib/ceph/libceph-common.so.2(+0x24f33f) [0x7f35b439233f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  3: (OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x40b4) [0x7f35b47e1924]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  4: (OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x2cb4) [0x56204cf9b8e4]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  5: (PaxosService::propose_pending()+0x15f) [0x56204cf0dc9f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  6: (Context::complete(int)+0x9) [0x56204cdf3c29]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  7: (SafeTimer::timer_thread()+0x17a) [0x7f35b44733ca]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  8: (SafeTimerThread::entry()+0xd) [0x7f35b44748ad]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f35b3e71ea7]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  10: clone()
Mar  9 00:44:31 dionysos ceph-mon[11308]:      0> 2022-03-09T00:44:31.177+0100 7f35ad9e1700 -1 *** Caught signal (Aborted) **
Mar  9 00:44:31 dionysos ceph-mon[11308]:  in thread 7f35ad9e1700 thread_name:safe_timer
Mar  9 00:44:31 dionysos ceph-mon[11308]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Mar  9 00:44:31 dionysos ceph-mon[11308]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f35b3e7d140]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  2: gsignal()
Mar  9 00:44:31 dionysos ceph-mon[11308]:  3: abort()
Mar  9 00:44:31 dionysos ceph-mon[11308]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x7f35b43921fe]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  5: /usr/lib/ceph/libceph-common.so.2(+0x24f33f) [0x7f35b439233f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  6: (OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x40b4) [0x7f35b47e1924]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  7: (OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x2cb4) [0x56204cf9b8e4]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  8: (PaxosService::propose_pending()+0x15f) [0x56204cf0dc9f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  9: (Context::complete(int)+0x9) [0x56204cdf3c29]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  10: (SafeTimer::timer_thread()+0x17a) [0x7f35b44733ca]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  11: (SafeTimerThread::entry()+0xd) [0x7f35b44748ad]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f35b3e71ea7]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  13: clone()
Mar  9 00:44:31 dionysos ceph-mon[11308]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar  9 00:44:31 dionysos systemd[1]: ceph-mon@dionysos.service: Main process exited, code=killed, status=6/ABRT
Mar  9 00:44:31 dionysos systemd[1]: ceph-mon@dionysos.service: Failed with result 'signal'.

Actions #8

Updated by Radoslaw Zarzynski about 2 years ago

Hello Sebastian!
Was there any change about the OSD count? I mean particularly OSD removal.

Actions #10

Updated by Radoslaw Zarzynski about 2 years ago

  • Priority changed from Normal to High
Actions #11

Updated by Sebastian Mazza about 2 years ago

Hello Radoslaw,
thank you for your response!

About two weeks ago I did first remove and then add 6 OSDs. I did not add or remove an OSD within the last 10 days. After I add the last OSD 10 days ago, I did at least 15 reboots of all physical servers without a problem.

Actions #12

Updated by Jan Horacek 3 months ago

we had similar issue and noticed, that we have lots of OSDs marked with status "new".
after couple of tests after getting rid of "new" status (by out/in every OSD, with norebalance/nobackfill during this operation) it looks like we got rid of the problem.

could i ask you guys to check ceph osd status for the OSDs ?

Actions

Also available in: Atom PDF