Bug #19989
closed"OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run
0%
Description
Run: http://qa-proxy.ceph.com/teuthology/yuriw-2017-05-19_03:47:57-rados-wip-yuri-testing_2017_5_19---basic-smithi
Job: 1194864
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2017-05-19_03:47:57-rados-wip-yuri-testing_2017_5_19---basic-smithi/1194864/teuthology.log
2017-05-19T03:57:24.271 INFO:tasks.ceph.mon.a.smithi077.stderr:/build/ceph-12.0.2-1341-g406a26a/src/mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::get_health(std::list<std::pair<health_status_t, std::basic_string<char> > >&, std::list<std::pair<health_status_t, std::basic_string<char> > >*, CephContext*) const' thread 7fcb8da81700 time 2017-05-19 03:57:23.973763 2017-05-19T03:57:24.271 INFO:tasks.ceph.mon.a.smithi077.stderr:/build/ceph-12.0.2-1341-g406a26a/src/mon/OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds) 2017-05-19T03:57:24.272 INFO:tasks.ceph.mon.a.smithi077.stderr: ceph version 12.0.2-1341-g406a26a (406a26a1c327a13df48890994379a5ebe7ccda97) 2017-05-19T03:57:24.272 INFO:tasks.ceph.mon.a.smithi077.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x56528de789fe] 2017-05-19T03:57:24.272 INFO:tasks.ceph.mon.a.smithi077.stderr: 2: (OSDMonitor::get_health(std::list<std::pair<health_status_t, std::string>, std::allocator<std::pair<health_status_t, std::string> > >&, std::list<std::pair<health_status_t, std::string>, std::allocator<std::pair<health_status_t, std::string> > >*, CephContext*) const+0x2831) [0x56528dcfd471] 2017-05-19T03:57:24.273 INFO:tasks.ceph.mon.a.smithi077.stderr: 3: (Monitor::get_health(std::list<std::string, std::allocator<std::string> >&, ceph::buffer::list*, ceph::Formatter*)+0xca) [0x56528dc924da] 2017-05-19T03:57:24.273 INFO:tasks.ceph.mon.a.smithi077.stderr: 4: (MgrMonitor::send_digests()+0x324) [0x56528dd9ab04] 2017-05-19T03:57:24.273 INFO:tasks.ceph.mon.a.smithi077.stderr: 5: (C_MonContext::finish(int)+0x27) [0x56528dc7b9f7] 2017-05-19T03:57:24.274 INFO:tasks.ceph.mon.a.smithi077.stderr: 6: (Context::complete(int)+0x9) [0x56528dcb46a9] 2017-05-19T03:57:24.274 INFO:tasks.ceph.mon.a.smithi077.stderr: 7: (SafeTimer::timer_thread()+0xec) [0x56528de754bc] 2017-05-19T03:57:24.274 INFO:tasks.ceph.mon.a.smithi077.stderr: 8: (SafeTimerThread::entry()+0xd) [0x56528de76e4d] 2017-05-19T03:57:24.277 INFO:tasks.ceph.mon.a.smithi077.stderr: 9: (()+0x8184) [0x7fcb92ee1184] 2017-05-19T03:57:24.277 INFO:tasks.ceph.mon.a.smithi077.stderr: 10: (clone()+0x6d) [0x7fcb917a2bed] 2017-05-19T03:57:24.277 INFO:tasks.ceph.mon.a.smithi077.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Sage Weil almost 7 years ago
appears to be related to this code, which assumes the osd is out but may not be true. fix might be something like the below but i didn't look at this very carefully
diff --git a/src/osd/OSDMap.cc b/src/osd/OSDMap.cc index 5503fb4..a89e82e 100644 --- a/src/osd/OSDMap.cc +++ b/src/osd/OSDMap.cc @@ -298,10 +298,12 @@ bool OSDMap::subtree_type_is_down(CephContext *cct, int id, int subtree_type, se { if (id >= 0) { bool is_down_ret = is_down(id); - if (is_down_ret) { - down_in_osds->insert(id); - } else { - up_in_osds->insert(id); + if (!is_out(id)) { + if (is_down_ret) { + down_in_osds->insert(id); + } else { + up_in_osds->insert(id); + } } return is_down_ret; }
Updated by Sebastian Wagner over 2 years ago
- Related to Bug #52535: monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds) added
Updated by Jan Horacek 2 months ago
we had similar issue and noticed, that we have lots of OSDs marked with status "new".
after couple of tests after getting rid of "new" status (by out/in every OSD, with norebalance/nobackfill during this operation) it looks like we got rid of the problem.
could i ask you guys to check ceph osd status for the OSDs ?
Updated by Radoslaw Zarzynski 2 months ago
- Tags set to medium-hanging-fruit
There is starting point for the fix: https://tracker.ceph.com/issues/19989#note-1.
Tagging as medmium-hanging-fruit,