Project

General

Profile

Actions

Bug #65401

open

msg: conneciton between mgr and osd is periodically down which leads heavy load to mgr

Added by Xinying Song about 1 month ago. Updated 26 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I find the connection between osd and mgr are periodically mark_down due to ms_connection_idle_timeout config.
This periodically markdown lead to hight latency for message processing in mgr in large scale cluster.
Because mgrc in OSD needs periodically sends pg_stats to mgr, when the connection is down, mgrc has to try establish the connection first.
Since all message processing in mgr is handled by a single dispatch-queue, it's easy to reach a bottle neck.
In one of my environment with 1200 osds, processing a pg_stats message in mgr may reach 9 seconds at most,
mainly because the mgr is busing processing many mgrreport messages which are sent when an mgrc is trying to connect to mgr,
and other messages have to wait in the queue.

The mark-down logic depends on AsyncConnection::last_active counter, it is updated every time when there is data come in,
but not always update when sending data, this the root cause of this problem.

Actions

Also available in: Atom PDF