Feature #7344: osd: add additional heartbeat on cluster interface - Ceph - Ceph

Actions

Copy link

Feature #7344

closed

osd: add additional heartbeat on cluster interface

Added by Sage Weil about 10 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Development

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

A user had a switch configuration problem (no jumbo frames) that prevented progress on the cluster interface but allowed heartbeats to go through. The cluster was unaware that there was a networking issue, and all pgs got stuck in various stages of peering.

Add another layer of heartbeat on the cluster interface that has a higher timeout so that if things are stalled out we can detect it. Possibly indicate in the failure report what the nature of the failure is so that it is easier for an admin to resolve the problem.

Actions

Copy link

Updated by Greg Farnum about 10 years ago

If we have to do heartbeating over the exact same connection we send our other traffic on, is there any advantage to having the separate heartbeat messenger at all? Since it has its own dispatch loop it can obviously be lower-latency, but it's not clear to me that we actually gain that much advantage from multiple low-latency connections if we can't count on it to mean we're functioning.
(ie if the cluster network fails we might take 10 seconds to detect it instead of 2, but does that kind of detection speed actually help us in that failure scenario?)

Actions

Copy link

Updated by Sage Weil over 9 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Feature #7344

osd: add additional heartbeat on cluster interface

Updated by Greg Farnum about 10 years ago

Updated by Sage Weil over 9 years ago