Bug #19989: "OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run - Ceph - Ceph

Actions

Copy link

Bug #19989

closed

"OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run

Added by Yuri Weinstein almost 7 years ago. Updated 2 months ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

medium-hanging-fruit

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

rados

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Run: http://qa-proxy.ceph.com/teuthology/yuriw-2017-05-19_03:47:57-rados-wip-yuri-testing_2017_5_19---basic-smithi
Job: 1194864
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2017-05-19_03:47:57-rados-wip-yuri-testing_2017_5_19---basic-smithi/1194864/teuthology.log

2017-05-19T03:57:24.271 INFO:tasks.ceph.mon.a.smithi077.stderr:/build/ceph-12.0.2-1341-g406a26a/src/mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::get_health(std::list<std::pair<health_status_t, std::basic_string<char> > >&, std::list<std::pair<health_status_t, std::basic_string<char> > >*, CephContext*) const' thread 7fcb8da81700 time 2017-05-19 03:57:23.973763
2017-05-19T03:57:24.271 INFO:tasks.ceph.mon.a.smithi077.stderr:/build/ceph-12.0.2-1341-g406a26a/src/mon/OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)
2017-05-19T03:57:24.272 INFO:tasks.ceph.mon.a.smithi077.stderr: ceph version 12.0.2-1341-g406a26a (406a26a1c327a13df48890994379a5ebe7ccda97)
2017-05-19T03:57:24.272 INFO:tasks.ceph.mon.a.smithi077.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x56528de789fe]
2017-05-19T03:57:24.272 INFO:tasks.ceph.mon.a.smithi077.stderr: 2: (OSDMonitor::get_health(std::list<std::pair<health_status_t, std::string>, std::allocator<std::pair<health_status_t, std::string> > >&, std::list<std::pair<health_status_t, std::string>, std::allocator<std::pair<health_status_t, std::string> > >*, CephContext*) const+0x2831) [0x56528dcfd471]
2017-05-19T03:57:24.273 INFO:tasks.ceph.mon.a.smithi077.stderr: 3: (Monitor::get_health(std::list<std::string, std::allocator<std::string> >&, ceph::buffer::list*, ceph::Formatter*)+0xca) [0x56528dc924da]
2017-05-19T03:57:24.273 INFO:tasks.ceph.mon.a.smithi077.stderr: 4: (MgrMonitor::send_digests()+0x324) [0x56528dd9ab04]
2017-05-19T03:57:24.273 INFO:tasks.ceph.mon.a.smithi077.stderr: 5: (C_MonContext::finish(int)+0x27) [0x56528dc7b9f7]
2017-05-19T03:57:24.274 INFO:tasks.ceph.mon.a.smithi077.stderr: 6: (Context::complete(int)+0x9) [0x56528dcb46a9]
2017-05-19T03:57:24.274 INFO:tasks.ceph.mon.a.smithi077.stderr: 7: (SafeTimer::timer_thread()+0xec) [0x56528de754bc]
2017-05-19T03:57:24.274 INFO:tasks.ceph.mon.a.smithi077.stderr: 8: (SafeTimerThread::entry()+0xd) [0x56528de76e4d]
2017-05-19T03:57:24.277 INFO:tasks.ceph.mon.a.smithi077.stderr: 9: (()+0x8184) [0x7fcb92ee1184]
2017-05-19T03:57:24.277 INFO:tasks.ceph.mon.a.smithi077.stderr: 10: (clone()+0x6d) [0x7fcb917a2bed]
2017-05-19T03:57:24.277 INFO:tasks.ceph.mon.a.smithi077.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Sage Weil almost 7 years ago

appears to be related to this code, which assumes the osd is out but may not be true. fix might be something like the below but i didn't look at this very carefully

diff --git a/src/osd/OSDMap.cc b/src/osd/OSDMap.cc
index 5503fb4..a89e82e 100644
--- a/src/osd/OSDMap.cc
+++ b/src/osd/OSDMap.cc
@@ -298,10 +298,12 @@ bool OSDMap::subtree_type_is_down(CephContext *cct, int id, int subtree_type, se
 {
   if (id >= 0) {
     bool is_down_ret = is_down(id);
-    if (is_down_ret) {
-      down_in_osds->insert(id);
-    } else {
-      up_in_osds->insert(id);
+    if (!is_out(id)) {
+      if (is_down_ret) {
+       down_in_osds->insert(id);
+      } else {
+       up_in_osds->insert(id);
+      }
     }
     return is_down_ret;
   }

Actions

Copy link

Updated by Sage Weil almost 7 years ago

Status changed from New to Resolved

Actions

Copy link

Updated by Sebastian Wagner over 2 years ago

Related to Bug #52535: monitor crashes after an OSD got destroyed: OSDMap.cc: 5686: FAILED ceph_assert(num_down_in_osds <= num_in_osds) added

Actions

Copy link

Updated by Jan Horacek 2 months ago

we had similar issue and noticed, that we have lots of OSDs marked with status "new".
after couple of tests after getting rid of "new" status (by out/in every OSD, with norebalance/nobackfill during this operation) it looks like we got rid of the problem.

could i ask you guys to check ceph osd status for the OSDs ?

Actions

Copy link

Updated by Radoslaw Zarzynski 2 months ago

Tags set to medium-hanging-fruit

There is starting point for the fix: https://tracker.ceph.com/issues/19989#note-1.
Tagging as medmium-hanging-fruit,

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #19989

"OSDMonitor.cc: 3545: FAILED assert(num_down_in_osds <= num_in_osds)" in rados run

Updated by Sage Weil almost 7 years ago

Updated by Sage Weil almost 7 years ago

Updated by Sebastian Wagner over 2 years ago

Updated by Jan Horacek 2 months ago

Updated by Radoslaw Zarzynski 2 months ago