Project

General

Profile

Actions

Bug #21687

closed

mgr: mark_down of osd without metadata is broken

Added by Sage Weil over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-10-05 16:47:46.720413 7fd687b4f700 10 mgr.server ms_verify_authorizer registering osd.1 session 0x55a1d08d0ca0 con 0x55a1d09346c0
2017-10-05 16:47:46.720423 7fd687b4f700 10 In get_auth_session_handler for protocol 2
2017-10-05 16:47:46.720492 7fd687b4f700  1 -- 172.21.15.113:6800/260182 >> 172.21.15.159:6800/2561 pipe(0x55a1cf82c000 sd=20 :6800 s=2 pgs=1589 cs=1 l=1 c=0x55a1d09346c0).setting up a delay queue on Pipe 0x55a1cf82c000
2017-10-05 16:47:46.722833 7fd687b4f700 10 _calc_signature seq 1 front_crc_ = 1637758479 middle_crc = 0 data_crc = 0 sig = 3732551276752750330
2017-10-05 16:47:46.722925 7fd68d30f700  1 -- 172.21.15.113:6800/260182 <== osd.1 172.21.15.159:6800/2561 1 ==== mgropen(osd.1) v2 ==== 10+0+0 (1637758479 0 0) 0x55a1d08ccec0 con 0x55a1d09346c0
2017-10-05 16:47:46.722954 7fd68d30f700  4 mgr.server handle_open from 0x55a1d09346c0  osd,1
2017-10-05 16:47:46.722958 7fd68d30f700  1 -- 172.21.15.113:6800/260182 --> 172.21.15.159:6800/2561 -- mgrconfigure() v1 -- ?+0 0x55a1ce8b2400 con 0x55a1d09346c0
2017-10-05 16:47:46.723023 7fd68794d700 10 _calc_signature seq 1 front_crc_ = 2791807819 middle_crc = 0 data_crc = 0 sig = 5971460742716778938
2017-10-05 16:47:46.723039 7fd68794d700 20 Putting signature in client message(seq # 1): sig = 5971460742716778938
2017-10-05 16:47:46.724939 7fd687b4f700 10 _calc_signature seq 2 front_crc_ = 4184223999 middle_crc = 0 data_crc = 0 sig = 14943916633820690389
2017-10-05 16:47:46.725008 7fd687b4f700 10 _calc_signature seq 3 front_crc_ = 296065787 middle_crc = 0 data_crc = 0 sig = 4128270110835625172
2017-10-05 16:47:46.725052 7fd68d30f700  1 -- 172.21.15.113:6800/260182 <== osd.1 172.21.15.159:6800/2561 2 ==== mgrreport(osd.1 +446-0 packed 5398) v4 ==== 39454+0+0 (4184223999 0 0) 0x55a1cf799500 con 0x55a1d09346c0
2017-10-05 16:47:46.725083 7fd68d30f700  4 mgr.server handle_report from 0x55a1d09346c0 osd,1
2017-10-05 16:47:46.725087 7fd68d30f700  1 mgr.server handle_report rejecting report from osd,1, since we do not have its metadata now.
2017-10-05 16:47:46.725090 7fd68d30f700  1 -- 172.21.15.113:6800/260182 mark_down 0x55a1d09346c0 -- 0x55a1cf82c000
2017-10-05 16:47:46.725126 7fd68d30f700  0 ms_deliver_dispatch: unhandled message 0x55a1cf799500 mgrreport(osd.1 +446-0 packed 5398) v4 from osd.1 172.21.15.159:6800/2561
2017-10-05 16:47:46.725423 7fd68d30f700  1 -- 172.21.15.113:6800/260182 <== osd.1 172.21.15.159:6800/2561 3 ==== pg_stats(2 pgs tid 0 v 0) v1 ==== 1312+0+0 (296065787 0 0) 0x55a1cf798300 con 0x55a1d09346c0
2017-10-05 16:47:46.806561 7fd687b4f700  1 -- 172.21.15.113:6800/260182 >> - pipe(0x55a1cf82e800 sd=20 :6800 s=0 pgs=0 cs=0 l=0 c=0x55a1d0934ea0).accept sd=20 -

happens every 1-2 seconds, for each message we get from the osd.

/a/sage-2017-10-05_16:11:00-rados-wip-sage-testing-2017-10-05-0846-distro-basic-smithi/1706300

An ugly side-effect of this mark_down call is that we aren't removing the con from osd_cons. This fixes that:

diff --git a/src/mgr/DaemonServer.cc b/src/mgr/DaemonServer.cc
index bdb3cc96a7..d9d811998a 100644
--- a/src/mgr/DaemonServer.cc
+++ b/src/mgr/DaemonServer.cc
@@ -426,6 +426,7 @@ bool DaemonServer::handle_report(MMgrReport *m)
       return false;
     }
     m->get_connection()->mark_down();
+    osd_cons[m->get_source().num].erase(session->con);
     session->put();

     return false;

..but doesn't help the fact that we kill all osd connections. the result is that we can't send scrub messages from mgr -> osd. Almost all of the failures in this run were affected by this:

http://pulpito.ceph.com/sage-2017-10-05_16:11:00-rados-wip-sage-testing-2017-10-05-0846-distro-basic-smithi/


Related issues 2 (0 open2 closed)

Related to mgr - Bug #20887: Services reported with blank hostname by mgrResolved08/02/2017

Actions
Copied to mgr - Backport #22197: luminous: mgr: mark_down of osd without metadata is brokenResolvedNathan CutlerActions
Actions

Also available in: Atom PDF