Project

General

Profile

Bug #52535

monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)

Added by Guillaume Abrioux over 1 year ago. Updated 11 months ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

seeing failures in ceph-volume CI because of monitor crashing after an OSD gets destroyed.

Sep 08 08:59:55 mon0 ceph-mon[10306]: *** Caught signal (Aborted) **
Sep 08 08:59:55 mon0 ceph-mon[10306]:  in thread 7ff579e98700 thread_name:ms_dispatch
Sep 08 08:59:55 mon0 ceph-mon[10306]: 2021-09-08T08:59:55.903+0000 7ff579e98700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7ff579e98700 time 2021-09-08T08:59:55.899203+0000
Sep 08 08:59:55 mon0 ceph-mon[10306]: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Sep 08 08:59:55 mon0 ceph-mon[10306]:  ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)

ceph version:

# ceph --version
ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)

ceph crash details:

{
    "crash_id": "2021-09-08T08:15:01.060487Z_b1737f83-054e-45f0-a339-0a5a5b6e835b",
    "timestamp": "2021-09-08T08:15:01.060487Z",
    "process_name": "ceph-mon",
    "entity_name": "mon.mon0",
    "ceph_version": "17.0.0-7502-gca906d0d",
    "utsname_hostname": "mon0",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-240.1.1.el8_3.x86_64",
    "utsname_version": "#1 SMP Thu Nov 19 17:20:08 UTC 2020",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8",
    "assert_condition": "num_down_in_osds <= num_in_osds",
    "assert_func": "void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc",
    "assert_line": 5686,
    "assert_thread_name": "ms_dispatch",
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f5b2c1b5700 time 2021-09-08T08:15:01.033797+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)\n",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f5b37577b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f5b39846c8c]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x284e4f) [0x7f5b39846e4f]",
        "(OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x3e5e) [0x7f5b39cfd11e]",
        "(OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x3d03) [0x55f87315b5f3]",
        "(PaxosService::propose_pending()+0x21a) [0x55f8730d625a]",
        "(PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xbd3) [0x55f8730d7273]",
        "(Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x27ac) [0x55f872f9df1c]",
        "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xb22) [0x55f872fa2e82]",
        "(Monitor::_ms_dispatch(Message*)+0x457) [0x55f872fa4117]",
        "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55f872fd459c]",
        "(DispatchQueue::entry()+0x14fa) [0x7f5b39ac4e5a]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f5b39b7a161]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f5b3756d14a]",
        "clone()" 
    ]
}

The monitor daemon can't restart after that, it keeps crashing like this in a loop.

see attachments for full log

https://jenkins.ceph.com/job/ceph-volume-prs-lvm-centos8-filestore-create/240/consoleFull

log View (60.6 KB) Guillaume Abrioux, 09/08/2021 08:56 AM

meta (2.93 KB) Guillaume Abrioux, 09/08/2021 08:57 AM


Related issues

Related to Ceph - Bug #19989: "OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run Resolved 05/19/2017

History

#1 Updated by Sebastian Wagner over 1 year ago

  • Related to Bug #19989: "OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run added

#2 Updated by Sebastian Wagner over 1 year ago

  • Description updated (diff)

#3 Updated by Sebastian Wagner over 1 year ago

  • Priority changed from Normal to High

Increasing priority, as this happens pretty often in the ceph-volume jenkins jobs recently

#4 Updated by Sebastian Wagner over 1 year ago

  • Subject changed from monitor crashes after an OSD got destroyed to monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)

#5 Updated by Neha Ojha over 1 year ago

The log attached has a sha1 ca906d0d7a65c8a598d397b764dd262cce645fe3, is this the first time you encountered this issue?
Has this test always been there? Just trying to understand if there was a new ceph change that broke it.
Is it possible to raise debug levels for the tests that reproduce this issue?

#6 Updated by Neha Ojha about 1 year ago

  • Status changed from New to Need More Info
  • Priority changed from High to Normal

#7 Updated by Sebastian Mazza 11 months ago

I Faced the same problem with ceph version 16.2.6. It occurred after shutting down all 3 physical servers of the cluster. After booting the physical servers agin 2 out of 3 Monitors crashes every view seconds. Every server has 3 OSDs, one NVMe and two HDD based. A reboot of all 3 physical servers did not solve the Problem. So, 2 of the 3 Monitors wher permanently crashing also after a reboot of all 3 physical servers. However, stopping every HDD based OSD and one of the NVME based OSDs "solved" the problem. The Problem did not come back after starting the OSDs one after the other.

Example from Syslog:


Mar  9 00:44:31 dionysos ceph-mon[11308]:     -1> 2022-03-09T00:44:31.173+0100 7f35ad9e1700 -1 ./src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f35ad9e1700 time 2022-03-09T00:44:31.174227+0100
Mar  9 00:44:31 dionysos ceph-mon[11308]: ./src/osd/OSDMap.cc: 5696: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Mar  9 00:44:31 dionysos ceph-mon[11308]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Mar  9 00:44:31 dionysos ceph-mon[11308]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7f35b43921b4]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  2: /usr/lib/ceph/libceph-common.so.2(+0x24f33f) [0x7f35b439233f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  3: (OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x40b4) [0x7f35b47e1924]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  4: (OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x2cb4) [0x56204cf9b8e4]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  5: (PaxosService::propose_pending()+0x15f) [0x56204cf0dc9f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  6: (Context::complete(int)+0x9) [0x56204cdf3c29]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  7: (SafeTimer::timer_thread()+0x17a) [0x7f35b44733ca]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  8: (SafeTimerThread::entry()+0xd) [0x7f35b44748ad]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f35b3e71ea7]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  10: clone()
Mar  9 00:44:31 dionysos ceph-mon[11308]:      0> 2022-03-09T00:44:31.177+0100 7f35ad9e1700 -1 *** Caught signal (Aborted) **
Mar  9 00:44:31 dionysos ceph-mon[11308]:  in thread 7f35ad9e1700 thread_name:safe_timer
Mar  9 00:44:31 dionysos ceph-mon[11308]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Mar  9 00:44:31 dionysos ceph-mon[11308]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f35b3e7d140]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  2: gsignal()
Mar  9 00:44:31 dionysos ceph-mon[11308]:  3: abort()
Mar  9 00:44:31 dionysos ceph-mon[11308]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x7f35b43921fe]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  5: /usr/lib/ceph/libceph-common.so.2(+0x24f33f) [0x7f35b439233f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  6: (OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x40b4) [0x7f35b47e1924]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  7: (OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x2cb4) [0x56204cf9b8e4]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  8: (PaxosService::propose_pending()+0x15f) [0x56204cf0dc9f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  9: (Context::complete(int)+0x9) [0x56204cdf3c29]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  10: (SafeTimer::timer_thread()+0x17a) [0x7f35b44733ca]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  11: (SafeTimerThread::entry()+0xd) [0x7f35b44748ad]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f35b3e71ea7]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  13: clone()
Mar  9 00:44:31 dionysos ceph-mon[11308]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar  9 00:44:31 dionysos systemd[1]: ceph-mon@dionysos.service: Main process exited, code=killed, status=6/ABRT
Mar  9 00:44:31 dionysos systemd[1]: ceph-mon@dionysos.service: Failed with result 'signal'.

#8 Updated by Radoslaw Zarzynski 11 months ago

Hello Sebastian!
Was there any change about the OSD count? I mean particularly OSD removal.

#10 Updated by Radoslaw Zarzynski 11 months ago

  • Priority changed from Normal to High

#11 Updated by Sebastian Mazza 11 months ago

Hello Radoslaw,
thank you for your response!

About two weeks ago I did first remove and then add 6 OSDs. I did not add or remove an OSD within the last 10 days. After I add the last OSD 10 days ago, I did at least 15 reboots of all physical servers without a problem.

Also available in: Atom PDF