Feature #3848: osd: gracefully handle cluster network heartbeat failure - Ceph - Ceph

Actions

Copy link

Feature #3848

closed

osd: gracefully handle cluster network heartbeat failure

Added by Ian Colle over 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

Sage Weil

Category:

OSD

Target version:

v0.65

% Done:

Source:

Support

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

From Kyle Bader

"Back in October we had a switch failure on our cluster (backend) network.
This was not noticed because this network is only utilized when Ceph needs
to do recovery operations. At some point we marked several OSDs out, the
cluster initiated recovery and the cluster started to error at the RADOS
gateway because the cephstores could not gossip or backfill. Heartbeats sent
to the monitor can detect network partitions on the public network and the
cluster can automatically recover, it would be valuable if Ceph was also
fault tolerant with regard to the cluster network."
It seems like it would be a rather simple implementation, heartbeat
monitoring on the cluster network."

Sage Comment:
We ping using the cluster
interface, but communicate with the mon via the 'public' interface, so
there is a disconnect. Most recently, we observed a clsuter network error
causing failure reports, the mon makring osds down, and the osds marking
themselves back up (bc they could communicate on the front-side).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Feature #3848

osd: gracefully handle cluster network heartbeat failure

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Neil Levine about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Ian Colle almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Ian Colle almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago