Project

General

Profile

Actions

Bug #65401

open

msg: conneciton between mgr and osd is periodically down which leads heavy load to mgr

Added by Xinying Song 24 days ago. Updated 11 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I find the connection between osd and mgr are periodically mark_down due to ms_connection_idle_timeout config.
This periodically markdown lead to hight latency for message processing in mgr in large scale cluster.
Because mgrc in OSD needs periodically sends pg_stats to mgr, when the connection is down, mgrc has to try establish the connection first.
Since all message processing in mgr is handled by a single dispatch-queue, it's easy to reach a bottle neck.
In one of my environment with 1200 osds, processing a pg_stats message in mgr may reach 9 seconds at most,
mainly because the mgr is busing processing many mgrreport messages which are sent when an mgrc is trying to connect to mgr,
and other messages have to wait in the queue.

The mark-down logic depends on AsyncConnection::last_active counter, it is updated every time when there is data come in,
but not always update when sending data, this the root cause of this problem.

Actions #1

Updated by Xinying Song 24 days ago

I'm not sure this is by designed or a mistake, so I push a pr for disccussion. pr:https://github.com/ceph/ceph/pull/56811

Actions #2

Updated by Xinying Song 18 days ago

Could anyone give a review on this? Thanks very much!

Actions #3

Updated by Xinying Song 11 days ago

the periodically connection fault can be found in log by following steps:
1. set ms_connection_idle_timeout=60; debug_mgrc = 4 in ceph.conf
2. restart one osd, say osd.0
3. `tail -f osd.0.log | grep ':6804'` will produce "Terminating session" and "Starting new session" every 60 seconds or 120 seconds. 6804 is the port that the mgr is listening on and the mgrc in osd will connect to.

Actions

Also available in: Atom PDF