Feature #3848: osd: gracefully handle cluster network heartbeat failure - Ceph - Ceph

Actions

Copy link

Feature #3848

closed

osd: gracefully handle cluster network heartbeat failure

Added by Ian Colle over 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

Sage Weil

Category:

OSD

Target version:

v0.65

% Done:

Source:

Support

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

From Kyle Bader

"Back in October we had a switch failure on our cluster (backend) network.
This was not noticed because this network is only utilized when Ceph needs
to do recovery operations. At some point we marked several OSDs out, the
cluster initiated recovery and the cluster started to error at the RADOS
gateway because the cephstores could not gossip or backfill. Heartbeats sent
to the monitor can detect network partitions on the public network and the
cluster can automatically recover, it would be valuable if Ceph was also
fault tolerant with regard to the cluster network."
It seems like it would be a rather simple implementation, heartbeat
monitoring on the cluster network."

Sage Comment:
We ping using the cluster
interface, but communicate with the mon via the 'public' interface, so
there is a disconnect. Most recently, we observed a clsuter network error
causing failure reports, the mon makring osds down, and the osds marking
themselves back up (bc they could communicate on the front-side).

Actions

Copy link

Updated by Sage Weil over 11 years ago

Category set to OSD

Actions

Copy link

Updated by Sage Weil about 11 years ago

One option: do not mark ourselves back up (after being wrongly marked down) unless we are able to successfully ping at least one of our peers.

If the cluster net is completely broken, this will avoid flapping.

However, if, say, our top-of-rack switch lost its uplink, we may be able to ping peers in our rack but not outside of it, in which case that won't work. Maybe we don't mark ourselves up unless we can ping >50% of our peers? Or some configurable threshold?

If the peers are in fact down, others will fail them. The weakness here is that our peer list is by definition stale (from when we were last up). We can supplement by replacing failed peers with other up osds at random.

If we can't ping anyone and mark ourselves up after some configurable period, we could then suicide, or at least log the problem to the monitor.

Actions

Copy link

Updated by Sage Weil about 11 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Sage Weil about 11 years ago

Tracker changed from Bug to Feature

Actions

Copy link

Updated by Neil Levine about 11 years ago

Status changed from New to 12

Actions

Copy link

Updated by Sage Weil about 11 years ago

Subject changed from Ping using Cluster Interface, but comms with mon via public interface to osd: gracefully handle cluster network heartbeat failure

Actions

Copy link