Bug #65401
openmsg: conneciton between mgr and osd is periodically down which leads heavy load to mgr
0%
Description
I find the connection between osd and mgr are periodically mark_down due to ms_connection_idle_timeout config.
This periodically markdown lead to hight latency for message processing in mgr in large scale cluster.
Because mgrc in OSD needs periodically sends pg_stats to mgr, when the connection is down, mgrc has to try establish the connection first.
Since all message processing in mgr is handled by a single dispatch-queue, it's easy to reach a bottle neck.
In one of my environment with 1200 osds, processing a pg_stats message in mgr may reach 9 seconds at most,
mainly because the mgr is busing processing many mgrreport messages which are sent when an mgrc is trying to connect to mgr,
and other messages have to wait in the queue.
The mark-down logic depends on AsyncConnection::last_active counter, it is updated every time when there is data come in,
but not always update when sending data, this the root cause of this problem.