Bug #52535: monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds) - RADOS - Ceph

Actions

Copy link

Bug #52535

open

monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)

Added by Guillaume Abrioux over 2 years ago. Updated 3 months ago.

Status:

Need More Info

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.0.0

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

seeing failures in ceph-volume CI because of monitor crashing after an OSD gets destroyed.

Sep 08 08:59:55 mon0 ceph-mon[10306]: *** Caught signal (Aborted) **
Sep 08 08:59:55 mon0 ceph-mon[10306]:  in thread 7ff579e98700 thread_name:ms_dispatch
Sep 08 08:59:55 mon0 ceph-mon[10306]: 2021-09-08T08:59:55.903+0000 7ff579e98700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7ff579e98700 time 2021-09-08T08:59:55.899203+0000
Sep 08 08:59:55 mon0 ceph-mon[10306]: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Sep 08 08:59:55 mon0 ceph-mon[10306]:  ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)

ceph version:

# ceph --version
ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)

ceph crash details:

{
    "crash_id": "2021-09-08T08:15:01.060487Z_b1737f83-054e-45f0-a339-0a5a5b6e835b",
    "timestamp": "2021-09-08T08:15:01.060487Z",
    "process_name": "ceph-mon",
    "entity_name": "mon.mon0",
    "ceph_version": "17.0.0-7502-gca906d0d",
    "utsname_hostname": "mon0",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-240.1.1.el8_3.x86_64",
    "utsname_version": "#1 SMP Thu Nov 19 17:20:08 UTC 2020",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8",
    "assert_condition": "num_down_in_osds <= num_in_osds",
    "assert_func": "void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc",
    "assert_line": 5686,
    "assert_thread_name": "ms_dispatch",
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f5b2c1b5700 time 2021-09-08T08:15:01.033797+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)\n",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f5b37577b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f5b39846c8c]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x284e4f) [0x7f5b39846e4f]",
        "(OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x3e5e) [0x7f5b39cfd11e]",
        "(OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x3d03) [0x55f87315b5f3]",
        "(PaxosService::propose_pending()+0x21a) [0x55f8730d625a]",
        "(PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xbd3) [0x55f8730d7273]",
        "(Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x27ac) [0x55f872f9df1c]",
        "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xb22) [0x55f872fa2e82]",
        "(Monitor::_ms_dispatch(Message*)+0x457) [0x55f872fa4117]",
        "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55f872fd459c]",
        "(DispatchQueue::entry()+0x14fa) [0x7f5b39ac4e5a]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f5b39b7a161]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f5b3756d14a]",
        "clone()" 
    ]
}

The monitor daemon can't restart after that, it keeps crashing like this in a loop.

see attachments for full log

https://jenkins.ceph.com/job/ceph-volume-prs-lvm-centos8-filestore-create/240/consoleFull

Files

Download all files

log (60.6 KB) log		Guillaume Abrioux, 09/08/2021 08:56 AM
meta (2.93 KB) meta		Guillaume Abrioux, 09/08/2021 08:57 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sebastian Wagner over 2 years ago

Related to Bug #19989: "OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run added

Actions

Copy link

Updated by Sebastian Wagner over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by Sebastian Wagner over 2 years ago

Priority changed from Normal to High

Increasing priority, as this happens pretty often in the ceph-volume jenkins jobs recently

Actions

Copy link

Updated by Sebastian Wagner over 2 years ago

Subject changed from monitor crashes after an OSD got destroyed to monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)

Actions

Copy link

Updated by Neha Ojha over 2 years ago

The log attached has a sha1 ca906d0d7a65c8a598d397b764dd262cce645fe3, is this the first time you encountered this issue?
Has this test always been there? Just trying to understand if there was a new ceph change that broke it.
Is it possible to raise debug levels for the tests that reproduce this issue?

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Status changed from New to Need More Info
Priority changed from High to Normal

Actions

Copy link

Updated by Sebastian Mazza about 2 years ago

I Faced the same problem with ceph version 16.2.6. It occurred after shutting down all 3 physical servers of the cluster. After booting the physical servers agin 2 out of 3 Monitors crashes every view seconds. Every server has 3 OSDs, one NVMe and two HDD based. A reboot of all 3 physical servers did not solve the Problem. So, 2 of the 3 Monitors wher permanently crashing also after a reboot of all 3 physical servers. However, stopping every HDD based OSD and one of the NVME based OSDs "solved" the problem. The Problem did not come back after starting the OSDs one after the other.

Example from Syslog:


Mar  9 00:44:31 dionysos ceph-mon[11308]:     -1> 2022-03-09T00:44:31.173+0100 7f35ad9e1700 -1 ./src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f35ad9e1700 time 2022-03-09T00:44:31.174227+0100
Mar  9 00:44:31 dionysos ceph-mon[11308]: ./src/osd/OSDMap.cc: 5696: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Mar  9 00:44:31 dionysos ceph-mon[11308]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Mar  9 00:44:31 dionysos ceph-mon[11308]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7f35b43921b4]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  2: /usr/lib/ceph/libceph-common.so.2(+0x24f33f) [0x7f35b439233f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  3: (OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x40b4) [0x7f35b47e1924]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  4: (OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x2cb4) [0x56204cf9b8e4]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  5: (PaxosService::propose_pending()+0x15f) [0x56204cf0dc9f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  6: (Context::complete(int)+0x9) [0x56204cdf3c29]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  7: (SafeTimer::timer_thread()+0x17a) [0x7f35b44733ca]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  8: (SafeTimerThread::entry()+0xd) [0x7f35b44748ad]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f35b3e71ea7]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  10: clone()
Mar  9 00:44:31 dionysos ceph-mon[11308]:      0> 2022-03-09T00:44:31.177+0100 7f35ad9e1700 -1 *** Caught signal (Aborted) **
Mar  9 00:44:31 dionysos ceph-mon[11308]:  in thread 7f35ad9e1700 thread_name:safe_timer
Mar  9 00:44:31 dionysos ceph-mon[11308]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Mar  9 00:44:31 dionysos ceph-mon[11308]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f35b3e7d140]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  2: gsignal()
Mar  9 00:44:31 dionysos ceph-mon[11308]:  3: abort()
Mar  9 00:44:31 dionysos ceph-mon[11308]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x7f35b43921fe]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  5: /usr/lib/ceph/libceph-common.so.2(+0x24f33f) [0x7f35b439233f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  6: (OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x40b4) [0x7f35b47e1924]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  7: (OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x2cb4) [0x56204cf9b8e4]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  8: (PaxosService::propose_pending()+0x15f) [0x56204cf0dc9f]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  9: (Context::complete(int)+0x9) [0x56204cdf3c29]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  10: (SafeTimer::timer_thread()+0x17a) [0x7f35b44733ca]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  11: (SafeTimerThread::entry()+0xd) [0x7f35b44748ad]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f35b3e71ea7]
Mar  9 00:44:31 dionysos ceph-mon[11308]:  13: clone()
Mar  9 00:44:31 dionysos ceph-mon[11308]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar  9 00:44:31 dionysos systemd[1]: ceph-mon@dionysos.service: Main process exited, code=killed, status=6/ABRT
Mar  9 00:44:31 dionysos systemd[1]: ceph-mon@dionysos.service: Failed with result 'signal'.

Actions

Copy link

Updated by Radoslaw Zarzynski about 2 years ago

Hello Sebastian!
Was there any change about the OSD count? I mean particularly OSD removal.

Actions

Copy link

Updated by Radoslaw Zarzynski about 2 years ago

Neha has made an interesting observation about the occurrences among different versions.

http://telemetry.front.sepia.ceph.com:4000/d/Nvj6XTaMk/spec-search?orgId=1&var-substr_1=&var-substr_2=&var-substr_3=&var-majors_affected=&var-minors_affected=&var-assert_function=&var-assert_condition=num_down_in_osds%20%3C%3D%20num_in_osds&var-total_results=5&var-sig_v1=&var-sig_v2=&var-daemons=&var-only_new_fingerprints=false&var-status_description=All&var-only_open=false

Perhaps this issue got introduced to pacific around 16.2.5?

Actions

Copy link

#10

Updated by Radoslaw Zarzynski about 2 years ago

Priority changed from Normal to High

Actions

Copy link

#11

Updated by Sebastian Mazza about 2 years ago

Hello Radoslaw,
thank you for your response!

About two weeks ago I did first remove and then add 6 OSDs. I did not add or remove an OSD within the last 10 days. After I add the last OSD 10 days ago, I did at least 15 reboots of all physical servers without a problem.

Actions

Copy link

#12

Updated by Jan Horacek 3 months ago

we had similar issue and noticed, that we have lots of OSDs marked with status "new".
after couple of tests after getting rid of "new" status (by out/in every OSD, with norebalance/nobackfill during this operation) it looks like we got rid of the problem.

could i ask you guys to check ceph osd status for the OSDs ?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #52535

monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)

Updated by Sebastian Wagner over 2 years ago

Updated by Sebastian Wagner over 2 years ago

Updated by Sebastian Wagner over 2 years ago

Updated by Sebastian Wagner over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Sebastian Mazza about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Sebastian Mazza about 2 years ago

Updated by Jan Horacek 3 months ago