Project

General

Profile

Actions

Bug #19989

closed

"OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run

Added by Yuri Weinstein almost 7 years ago. Updated 2 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
medium-hanging-fruit
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://qa-proxy.ceph.com/teuthology/yuriw-2017-05-19_03:47:57-rados-wip-yuri-testing_2017_5_19---basic-smithi
Job: 1194864
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2017-05-19_03:47:57-rados-wip-yuri-testing_2017_5_19---basic-smithi/1194864/teuthology.log

2017-05-19T03:57:24.271 INFO:tasks.ceph.mon.a.smithi077.stderr:/build/ceph-12.0.2-1341-g406a26a/src/mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::get_health(std::list<std::pair<health_status_t, std::basic_string<char> > >&, std::list<std::pair<health_status_t, std::basic_string<char> > >*, CephContext*) const' thread 7fcb8da81700 time 2017-05-19 03:57:23.973763
2017-05-19T03:57:24.271 INFO:tasks.ceph.mon.a.smithi077.stderr:/build/ceph-12.0.2-1341-g406a26a/src/mon/OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)
2017-05-19T03:57:24.272 INFO:tasks.ceph.mon.a.smithi077.stderr: ceph version 12.0.2-1341-g406a26a (406a26a1c327a13df48890994379a5ebe7ccda97)
2017-05-19T03:57:24.272 INFO:tasks.ceph.mon.a.smithi077.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x56528de789fe]
2017-05-19T03:57:24.272 INFO:tasks.ceph.mon.a.smithi077.stderr: 2: (OSDMonitor::get_health(std::list<std::pair<health_status_t, std::string>, std::allocator<std::pair<health_status_t, std::string> > >&, std::list<std::pair<health_status_t, std::string>, std::allocator<std::pair<health_status_t, std::string> > >*, CephContext*) const+0x2831) [0x56528dcfd471]
2017-05-19T03:57:24.273 INFO:tasks.ceph.mon.a.smithi077.stderr: 3: (Monitor::get_health(std::list<std::string, std::allocator<std::string> >&, ceph::buffer::list*, ceph::Formatter*)+0xca) [0x56528dc924da]
2017-05-19T03:57:24.273 INFO:tasks.ceph.mon.a.smithi077.stderr: 4: (MgrMonitor::send_digests()+0x324) [0x56528dd9ab04]
2017-05-19T03:57:24.273 INFO:tasks.ceph.mon.a.smithi077.stderr: 5: (C_MonContext::finish(int)+0x27) [0x56528dc7b9f7]
2017-05-19T03:57:24.274 INFO:tasks.ceph.mon.a.smithi077.stderr: 6: (Context::complete(int)+0x9) [0x56528dcb46a9]
2017-05-19T03:57:24.274 INFO:tasks.ceph.mon.a.smithi077.stderr: 7: (SafeTimer::timer_thread()+0xec) [0x56528de754bc]
2017-05-19T03:57:24.274 INFO:tasks.ceph.mon.a.smithi077.stderr: 8: (SafeTimerThread::entry()+0xd) [0x56528de76e4d]
2017-05-19T03:57:24.277 INFO:tasks.ceph.mon.a.smithi077.stderr: 9: (()+0x8184) [0x7fcb92ee1184]
2017-05-19T03:57:24.277 INFO:tasks.ceph.mon.a.smithi077.stderr: 10: (clone()+0x6d) [0x7fcb917a2bed]
2017-05-19T03:57:24.277 INFO:tasks.ceph.mon.a.smithi077.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #52535: monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds)Need More Info

Actions
Actions #1

Updated by Sage Weil almost 7 years ago

appears to be related to this code, which assumes the osd is out but may not be true. fix might be something like the below but i didn't look at this very carefully

diff --git a/src/osd/OSDMap.cc b/src/osd/OSDMap.cc
index 5503fb4..a89e82e 100644
--- a/src/osd/OSDMap.cc
+++ b/src/osd/OSDMap.cc
@@ -298,10 +298,12 @@ bool OSDMap::subtree_type_is_down(CephContext *cct, int id, int subtree_type, se
 {
   if (id >= 0) {
     bool is_down_ret = is_down(id);
-    if (is_down_ret) {
-      down_in_osds->insert(id);
-    } else {
-      up_in_osds->insert(id);
+    if (!is_out(id)) {
+      if (is_down_ret) {
+       down_in_osds->insert(id);
+      } else {
+       up_in_osds->insert(id);
+      }
     }
     return is_down_ret;
   }

Actions #2

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Resolved
Actions #3

Updated by Sebastian Wagner over 2 years ago

  • Related to Bug #52535: monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds) added
Actions #4

Updated by Jan Horacek 2 months ago

we had similar issue and noticed, that we have lots of OSDs marked with status "new".
after couple of tests after getting rid of "new" status (by out/in every OSD, with norebalance/nobackfill during this operation) it looks like we got rid of the problem.

could i ask you guys to check ceph osd status for the OSDs ?

Actions #5

Updated by Radoslaw Zarzynski 2 months ago

  • Tags set to medium-hanging-fruit

There is starting point for the fix: https://tracker.ceph.com/issues/19989#note-1.
Tagging as medmium-hanging-fruit,

Actions

Also available in: Atom PDF