Project

General

Profile

Actions

Feature #3848

closed

osd: gracefully handle cluster network heartbeat failure

Added by Ian Colle over 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Support
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

From Kyle Bader

"Back in October we had a switch failure on our cluster (backend) network.
This was not noticed because this network is only utilized when Ceph needs
to do recovery operations. At some point we marked several OSDs out, the
cluster initiated recovery and the cluster started to error at the RADOS
gateway because the cephstores could not gossip or backfill. Heartbeats sent
to the monitor can detect network partitions on the public network and the
cluster can automatically recover, it would be valuable if Ceph was also
fault tolerant with regard to the cluster network."
It seems like it would be a rather simple implementation, heartbeat
monitoring on the cluster network."

Sage Comment:
We ping using the cluster
interface, but communicate with the mon via the 'public' interface, so
there is a disconnect. Most recently, we observed a clsuter network error
causing failure reports, the mon makring osds down, and the osds marking
themselves back up (bc they could communicate on the front-side).

Actions

Also available in: Atom PDF