Project

General

Profile

Bug #52535

Updated by Sebastian Wagner over 2 years ago

seeing failures in ceph-volume CI because of monitor crashing after an OSD gets destroyed. 

 <pre><code class="text"> 
 Sep 08 08:59:55 mon0 ceph-mon[10306]: *** Caught signal (Aborted) ** 
 Sep 08 08:59:55 mon0 ceph-mon[10306]:    in thread 7ff579e98700 thread_name:ms_dispatch 
 Sep 08 08:59:55 mon0 ceph-mon[10306]: 2021-09-08T08:59:55.903+0000 7ff579e98700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7ff579e98700 time 2021-09-08T08:59:55.899203+0000 
 Sep 08 08:59:55 mon0 ceph-mon[10306]: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds) 
 Sep 08 08:59:55 mon0 ceph-mon[10306]:    ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev) 
 </code></pre> 

 ceph version: 
 <pre><code class="text"> 
 # ceph --version 
 ceph version 17.0.0-7502-gca906d0d (ca906d0d7a65c8a598d397b764dd262cce645fe3) quincy (dev) 
 </code></pre> 

 ceph crash details: 
 <pre><code class="text"> 
 { 
     "crash_id": "2021-09-08T08:15:01.060487Z_b1737f83-054e-45f0-a339-0a5a5b6e835b", 
     "timestamp": "2021-09-08T08:15:01.060487Z", 
     "process_name": "ceph-mon", 
     "entity_name": "mon.mon0", 
     "ceph_version": "17.0.0-7502-gca906d0d", 
     "utsname_hostname": "mon0", 
     "utsname_sysname": "Linux", 
     "utsname_release": "4.18.0-240.1.1.el8_3.x86_64", 
     "utsname_version": "#1 SMP Thu Nov 19 17:20:08 UTC 2020", 
     "utsname_machine": "x86_64", 
     "os_name": "CentOS Linux", 
     "os_id": "centos", 
     "os_version_id": "8", 
     "os_version": "8", 
     "assert_condition": "num_down_in_osds <= num_in_osds", 
     "assert_func": "void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const", 
     "assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc", 
     "assert_line": 5686, 
     "assert_thread_name": "ms_dispatch", 
     "assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: In function 'void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const' thread 7f5b2c1b5700 time 2021-09-08T08:15:01.033797+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7502-gca906d0d/rpm/el8/BUILD/ceph-17.0.0-7502-gca906d0d/src/osd/OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)\n", 
     "backtrace": [ 
         "/lib64/libpthread.so.0(+0x12b20) [0x7f5b37577b20]", 
         "gsignal()", 
         "abort()", 
         "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f5b39846c8c]", 
         "/usr/lib64/ceph/libceph-common.so.2(+0x284e4f) [0x7f5b39846e4f]", 
         "(OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const+0x3e5e) [0x7f5b39cfd11e]", 
         "(OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x3d03) [0x55f87315b5f3]", 
         "(PaxosService::propose_pending()+0x21a) [0x55f8730d625a]", 
         "(PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xbd3) [0x55f8730d7273]", 
         "(Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x27ac) [0x55f872f9df1c]", 
         "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xb22) [0x55f872fa2e82]", 
         "(Monitor::_ms_dispatch(Message*)+0x457) [0x55f872fa4117]", 
         "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55f872fd459c]", 
         "(DispatchQueue::entry()+0x14fa) [0x7f5b39ac4e5a]", 
         "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f5b39b7a161]", 
         "/lib64/libpthread.so.0(+0x814a) [0x7f5b3756d14a]", 
         "clone()" 
     ] 
 } 
 </code></pre> 


 The monitor daemon can't restart after that, it keeps crashing like this in a loop. 

 see attachments for full log 

 https://jenkins.ceph.com/job/ceph-volume-prs-lvm-centos8-filestore-create/240/consoleFull 

Back