Feature #52459
open
mds: add failed connections warning
Added by xinyu wang over 2 years ago.
Updated over 2 years ago.
Description
MDS will get stuck and keep retrying the connection if it fails to connect to other daemon successfully, for example, some osds
(Both MDS and osd can communicate with mon, but they cannot communicate with each other).
Add a new MDSHealthMetric to report failed connections.
Messenger records the failed connections.
MDS gets failed connections from messenger and reports them to mon by beacon.
- Project changed from Ceph to CephFS
What specific scenario are you trying to avoid here? Messenger-level failed connection warnings are probably not appropriate, but we may be able to come up with a heuristic for when we can't communicate with OSDs the monitor says are up.
Greg Farnum wrote:
What specific scenario are you trying to avoid here? Messenger-level failed connection warnings are probably not appropriate, but we may be able to come up with a heuristic for when we can't communicate with OSDs the monitor says are up.
As you mentioned, This MDSHealthMetric try to avoid the situation when mon tells mds that these OSDs are up but mds cannot connect to OSDs due to network reasons. Peer ip and port of connections can be reported in the HealthMetric, which makes it easier to locate specific network problems.
We want to identify and record failed connections in the Messenger, because these recorded connections may be used in other similar situations, such as osd and mgr.
- Status changed from New to Need More Info
xinyu wang wrote:
Greg Farnum wrote:
What specific scenario are you trying to avoid here? Messenger-level failed connection warnings are probably not appropriate, but we may be able to come up with a heuristic for when we can't communicate with OSDs the monitor says are up.
As you mentioned, This MDSHealthMetric try to avoid the situation when mon tells mds that these OSDs are up but mds cannot connect to OSDs due to network reasons. Peer ip and port of connections can be reported in the HealthMetric, which makes it easier to locate specific network problems.
We want to identify and record failed connections in the Messenger, because these recorded connections may be used in other similar situations, such as osd and mgr.
Would the "slow metadata i/o" warnings not be generated in this situation? That would seem sufficient I think?
Also available in: Atom
PDF