Project

General

Profile

Actions

Feature #7344

closed

osd: add additional heartbeat on cluster interface

Added by Sage Weil about 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

A user had a switch configuration problem (no jumbo frames) that prevented progress on the cluster interface but allowed heartbeats to go through. The cluster was unaware that there was a networking issue, and all pgs got stuck in various stages of peering.

Add another layer of heartbeat on the cluster interface that has a higher timeout so that if things are stalled out we can detect it. Possibly indicate in the failure report what the nature of the failure is so that it is easier for an admin to resolve the problem.

Actions #1

Updated by Greg Farnum about 10 years ago

If we have to do heartbeating over the exact same connection we send our other traffic on, is there any advantage to having the separate heartbeat messenger at all? Since it has its own dispatch loop it can obviously be lower-latency, but it's not clear to me that we actually gain that much advantage from multiple low-latency connections if we can't count on it to mean we're functioning.
(ie if the cluster network fails we might take 10 seconds to detect it instead of 2, but does that kind of detection speed actually help us in that failure scenario?)

Actions #2

Updated by Sage Weil over 9 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF