Bug #52535
Updated by Sebastian Wagner over 2 years ago
seeing failures in ceph-volume CI because of monitor crashing after an OSD gets destroyed.
<pre><code class="text">
Sep 08 08:59:55 mon0 ceph-mon[10306]: *** Caught signal (Aborted) **
Sep 08 08:59:55 mon0 ceph-mon[10306]: in thread 7ff579e98700 thread_name:ms_dispatch
Sep 08 08:59:55 mon0 ceph-mon[10306]: 2021-09-08T08:59:55.903+0000 7ff579e98700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7ff579e98700 time 2021-09-08T08:59:55.899203+0000
Sep 08 08:59:55 mon0 ceph-mon[10306]: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)
Sep 08 08:59:55 mon0 ceph-mon[10306]: ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)
</code></pre>
ceph version:
<pre><code class="text">
# ceph --version
ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev)
</code></pre>
ceph crash details:
<pre><code class="text">
{
"crash_id": "2021-09-08T08:15:01.060487Z_b1737f83-054e-45f0-a339-0a5a5b6e835b",
"timestamp": "2021-09-08T08:15:01.060487Z",
"process_name": "ceph-mon",
"entity_name": "mon.mon0",
"ceph_version": "17.0.0-7502-gca906d0d",
"utsname_hostname": "mon0",
"utsname_sysname": "Linux",
"utsname_release": "4.18.0-240.1.1.el8_3.x86_64",
"utsname_version": "#1 SMP Thu Nov 19 17:20:08 UTC 2020",
"utsname_machine": "x86_64",
"os_name": "CentOS Linux",
"os_id": "centos",
"os_version_id": "8",
"os_version": "8",
"assert_condition": "num_down_in_osds <= num_in_osds",
"assert_func": "void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const",
"assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc",
"assert_line": 5686,
"assert_thread_name": "ms_dispatch",
"assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f5b2c1b5700 time 2021-09-08T08:15:01.033797+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)\n",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f5b37577b20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f5b39846c8c]",
"/usr/lib64/ceph/libceph-common.so.2(+0x284e4f) [0x7f5b39846e4f]",
"(OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x3e5e) [0x7f5b39cfd11e]",
"(OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x3d03) [0x55f87315b5f3]",
"(PaxosService::propose_pending()+0x21a) [0x55f8730d625a]",
"(PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xbd3) [0x55f8730d7273]",
"(Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x27ac) [0x55f872f9df1c]",
"(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xb22) [0x55f872fa2e82]",
"(Monitor::_ms_dispatch(Message*)+0x457) [0x55f872fa4117]",
"(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55f872fd459c]",
"(DispatchQueue::entry()+0x14fa) [0x7f5b39ac4e5a]",
"(DispatchQueue::DispatchThread::entry()+0x11) [0x7f5b39b7a161]",
"/lib64/libpthread.so.0(+0x814a) [0x7f5b3756d14a]",
"clone()"
]
}
</code></pre>
The monitor daemon can't restart after that, it keeps crashing like this in a loop.
see attachments for full log
https://jenkins.ceph.com/job/ceph-volume-prs-lvm-centos8-filestore-create/240/consoleFull