Bug #52535
openmonitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
0%
Description
seeing failures in ceph-volume CI because of monitor crashing after an OSD gets destroyed.
Sep 08 08:59:55 mon0 ceph-mon[10306]: *** Caught signal (Aborted) **
Sep 08 08:59:55 mon0 ceph-mon[10306]: in thread 7ff579e98700 thread_name:ms_dispatch
Sep 08 08:59:55 mon0 ceph-mon[10306]: 2021-09-08T08:59:55.903+0000 7ff579e98700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7ff579e98700 time 2021-09-08T08:59:55.899203+0000
Sep 08 08:59:55 mon0 ceph-mon[10306]: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Sep 08 08:59:55 mon0 ceph-mon[10306]: ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)
ceph version:
# ceph --version
ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)
ceph crash details:
{
"crash_id": "2021-09-08T08:15:01.060487Z_b1737f83-054e-45f0-a339-0a5a5b6e835b",
"timestamp": "2021-09-08T08:15:01.060487Z",
"process_name": "ceph-mon",
"entity_name": "mon.mon0",
"ceph_version": "17.0.0-7502-gca906d0d",
"utsname_hostname": "mon0",
"utsname_sysname": "Linux",
"utsname_release": "4.18.0-240.1.1.el8_3.x86_64",
"utsname_version": "#1 SMP Thu Nov 19 17:20:08 UTC 2020",
"utsname_machine": "x86_64",
"os_name": "CentOS Linux",
"os_id": "centos",
"os_version_id": "8",
"os_version": "8",
"assert_condition": "num_down_in_osds <= num_in_osds",
"assert_func": "void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const",
"assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc",
"assert_line": 5686,
"assert_thread_name": "ms_dispatch",
"assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f5b2c1b5700 time 2021-09-08T08:15:01.033797+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)\n",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f5b37577b20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f5b39846c8c]",
"/usr/lib64/ceph/libceph-common.so.2(+0x284e4f) [0x7f5b39846e4f]",
"(OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x3e5e) [0x7f5b39cfd11e]",
"(OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x3d03) [0x55f87315b5f3]",
"(PaxosService::propose_pending()+0x21a) [0x55f8730d625a]",
"(PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xbd3) [0x55f8730d7273]",
"(Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x27ac) [0x55f872f9df1c]",
"(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xb22) [0x55f872fa2e82]",
"(Monitor::_ms_dispatch(Message*)+0x457) [0x55f872fa4117]",
"(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55f872fd459c]",
"(DispatchQueue::entry()+0x14fa) [0x7f5b39ac4e5a]",
"(DispatchQueue::DispatchThread::entry()+0x11) [0x7f5b39b7a161]",
"/lib64/libpthread.so.0(+0x814a) [0x7f5b3756d14a]",
"clone()"
]
}
The monitor daemon can't restart after that, it keeps crashing like this in a loop.
see attachments for full log
https://jenkins.ceph.com/job/ceph-volume-prs-lvm-centos8-filestore-create/240/consoleFull
Files
Updated by Sebastian Wagner over 2 years ago
- Related to Bug #19989: "OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run added
Updated by Sebastian Wagner over 2 years ago
- Priority changed from Normal to High
Increasing priority, as this happens pretty often in the ceph-volume jenkins jobs recently
Updated by Sebastian Wagner over 2 years ago
- Subject changed from monitor crashes after an OSD got destroyed to monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Updated by Neha Ojha over 2 years ago
The log attached has a sha1 ca906d0d7a65c8a598d397b764dd262cce645fe3, is this the first time you encountered this issue?
Has this test always been there? Just trying to understand if there was a new ceph change that broke it.
Is it possible to raise debug levels for the tests that reproduce this issue?
Updated by Neha Ojha over 2 years ago
- Status changed from New to Need More Info
- Priority changed from High to Normal
Updated by Sebastian Mazza about 2 years ago
I Faced the same problem with ceph version 16.2.6. It occurred after shutting down all 3 physical servers of the cluster. After booting the physical servers agin 2 out of 3 Monitors crashes every view seconds. Every server has 3 OSDs, one NVMe and two HDD based. A reboot of all 3 physical servers did not solve the Problem. So, 2 of the 3 Monitors wher permanently crashing also after a reboot of all 3 physical servers. However, stopping every HDD based OSD and one of the NVME based OSDs "solved" the problem. The Problem did not come back after starting the OSDs one after the other.
Example from Syslog:
Mar 9 00:44:31 dionysos ceph-mon[11308]: -1> 2022-03-09T00:44:31.173+0100 7f35ad9e1700 -1 ./src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f35ad9e1700 time 2022-03-09T00:44:31.174227+0100 Mar 9 00:44:31 dionysos ceph-mon[11308]: ./src/osd/OSDMap.cc: 5696: FAILED ceph_assert(num_down_in_osds <= num_in_osds) Mar 9 00:44:31 dionysos ceph-mon[11308]: ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable) Mar 9 00:44:31 dionysos ceph-mon[11308]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7f35b43921b4] Mar 9 00:44:31 dionysos ceph-mon[11308]: 2: /usr/lib/ceph/libceph-common.so.2(+0x24f33f) [0x7f35b439233f] Mar 9 00:44:31 dionysos ceph-mon[11308]: 3: (OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x40b4) [0x7f35b47e1924] Mar 9 00:44:31 dionysos ceph-mon[11308]: 4: (OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x2cb4) [0x56204cf9b8e4] Mar 9 00:44:31 dionysos ceph-mon[11308]: 5: (PaxosService::propose_pending()+0x15f) [0x56204cf0dc9f] Mar 9 00:44:31 dionysos ceph-mon[11308]: 6: (Context::complete(int)+0x9) [0x56204cdf3c29] Mar 9 00:44:31 dionysos ceph-mon[11308]: 7: (SafeTimer::timer_thread()+0x17a) [0x7f35b44733ca] Mar 9 00:44:31 dionysos ceph-mon[11308]: 8: (SafeTimerThread::entry()+0xd) [0x7f35b44748ad] Mar 9 00:44:31 dionysos ceph-mon[11308]: 9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f35b3e71ea7] Mar 9 00:44:31 dionysos ceph-mon[11308]: 10: clone() Mar 9 00:44:31 dionysos ceph-mon[11308]: 0> 2022-03-09T00:44:31.177+0100 7f35ad9e1700 -1 *** Caught signal (Aborted) ** Mar 9 00:44:31 dionysos ceph-mon[11308]: in thread 7f35ad9e1700 thread_name:safe_timer Mar 9 00:44:31 dionysos ceph-mon[11308]: ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable) Mar 9 00:44:31 dionysos ceph-mon[11308]: 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f35b3e7d140] Mar 9 00:44:31 dionysos ceph-mon[11308]: 2: gsignal() Mar 9 00:44:31 dionysos ceph-mon[11308]: 3: abort() Mar 9 00:44:31 dionysos ceph-mon[11308]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x7f35b43921fe] Mar 9 00:44:31 dionysos ceph-mon[11308]: 5: /usr/lib/ceph/libceph-common.so.2(+0x24f33f) [0x7f35b439233f] Mar 9 00:44:31 dionysos ceph-mon[11308]: 6: (OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x40b4) [0x7f35b47e1924] Mar 9 00:44:31 dionysos ceph-mon[11308]: 7: (OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x2cb4) [0x56204cf9b8e4] Mar 9 00:44:31 dionysos ceph-mon[11308]: 8: (PaxosService::propose_pending()+0x15f) [0x56204cf0dc9f] Mar 9 00:44:31 dionysos ceph-mon[11308]: 9: (Context::complete(int)+0x9) [0x56204cdf3c29] Mar 9 00:44:31 dionysos ceph-mon[11308]: 10: (SafeTimer::timer_thread()+0x17a) [0x7f35b44733ca] Mar 9 00:44:31 dionysos ceph-mon[11308]: 11: (SafeTimerThread::entry()+0xd) [0x7f35b44748ad] Mar 9 00:44:31 dionysos ceph-mon[11308]: 12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f35b3e71ea7] Mar 9 00:44:31 dionysos ceph-mon[11308]: 13: clone() Mar 9 00:44:31 dionysos ceph-mon[11308]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Mar 9 00:44:31 dionysos systemd[1]: ceph-mon@dionysos.service: Main process exited, code=killed, status=6/ABRT Mar 9 00:44:31 dionysos systemd[1]: ceph-mon@dionysos.service: Failed with result 'signal'.
Updated by Radoslaw Zarzynski about 2 years ago
Hello Sebastian!
Was there any change about the OSD count? I mean particularly OSD removal.
Updated by Radoslaw Zarzynski about 2 years ago
Neha has made an interesting observation about the occurrences among different versions.
Perhaps this issue got introduced to pacific around 16.2.5?
Updated by Radoslaw Zarzynski about 2 years ago
- Priority changed from Normal to High
Updated by Sebastian Mazza about 2 years ago
Hello Radoslaw,
thank you for your response!
About two weeks ago I did first remove and then add 6 OSDs. I did not add or remove an OSD within the last 10 days. After I add the last OSD 10 days ago, I did at least 15 reboots of all physical servers without a problem.
Updated by Jan Horacek 3 months ago
we had similar issue and noticed, that we have lots of OSDs marked with status "new".
after couple of tests after getting rid of "new" status (by out/in every OSD, with norebalance/nobackfill during this operation) it looks like we got rid of the problem.
could i ask you guys to check ceph osd status for the OSDs ?