Project

General

Profile

Actions

Feature #52459

open

mds: add failed connections warning

Added by xinyu wang over 2 years ago. Updated over 2 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

MDS will get stuck and keep retrying the connection if it fails to connect to other daemon successfully, for example, some osds
(Both MDS and osd can communicate with mon, but they cannot communicate with each other).
Add a new MDSHealthMetric to report failed connections.
Messenger records the failed connections.
MDS gets failed connections from messenger and reports them to mon by beacon.

Actions #1

Updated by Greg Farnum over 2 years ago

  • Project changed from Ceph to CephFS
Actions #2

Updated by Greg Farnum over 2 years ago

What specific scenario are you trying to avoid here? Messenger-level failed connection warnings are probably not appropriate, but we may be able to come up with a heuristic for when we can't communicate with OSDs the monitor says are up.

Actions #3

Updated by xinyu wang over 2 years ago

Greg Farnum wrote:

What specific scenario are you trying to avoid here? Messenger-level failed connection warnings are probably not appropriate, but we may be able to come up with a heuristic for when we can't communicate with OSDs the monitor says are up.

As you mentioned, This MDSHealthMetric try to avoid the situation when mon tells mds that these OSDs are up but mds cannot connect to OSDs due to network reasons. Peer ip and port of connections can be reported in the HealthMetric, which makes it easier to locate specific network problems.
We want to identify and record failed connections in the Messenger, because these recorded connections may be used in other similar situations, such as osd and mgr.

Actions #4

Updated by Patrick Donnelly over 2 years ago

  • Status changed from New to Need More Info

xinyu wang wrote:

Greg Farnum wrote:

What specific scenario are you trying to avoid here? Messenger-level failed connection warnings are probably not appropriate, but we may be able to come up with a heuristic for when we can't communicate with OSDs the monitor says are up.

As you mentioned, This MDSHealthMetric try to avoid the situation when mon tells mds that these OSDs are up but mds cannot connect to OSDs due to network reasons. Peer ip and port of connections can be reported in the HealthMetric, which makes it easier to locate specific network problems.
We want to identify and record failed connections in the Messenger, because these recorded connections may be used in other similar situations, such as osd and mgr.

Would the "slow metadata i/o" warnings not be generated in this situation? That would seem sufficient I think?

Actions

Also available in: Atom PDF