Feature #7344
closedosd: add additional heartbeat on cluster interface
0%
Description
A user had a switch configuration problem (no jumbo frames) that prevented progress on the cluster interface but allowed heartbeats to go through. The cluster was unaware that there was a networking issue, and all pgs got stuck in various stages of peering.
Add another layer of heartbeat on the cluster interface that has a higher timeout so that if things are stalled out we can detect it. Possibly indicate in the failure report what the nature of the failure is so that it is easier for an admin to resolve the problem.
Updated by Greg Farnum about 10 years ago
If we have to do heartbeating over the exact same connection we send our other traffic on, is there any advantage to having the separate heartbeat messenger at all? Since it has its own dispatch loop it can obviously be lower-latency, but it's not clear to me that we actually gain that much advantage from multiple low-latency connections if we can't count on it to mean we're functioning.
(ie if the cluster network fails we might take 10 seconds to detect it instead of 2, but does that kind of detection speed actually help us in that failure scenario?)