Project

General

Profile

Feature #3848

osd: gracefully handle cluster network heartbeat failure

Added by Ian Colle almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Support
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

From Kyle Bader

"Back in October we had a switch failure on our cluster (backend) network.
This was not noticed because this network is only utilized when Ceph needs
to do recovery operations. At some point we marked several OSDs out, the
cluster initiated recovery and the cluster started to error at the RADOS
gateway because the cephstores could not gossip or backfill. Heartbeats sent
to the monitor can detect network partitions on the public network and the
cluster can automatically recover, it would be valuable if Ceph was also
fault tolerant with regard to the cluster network."
It seems like it would be a rather simple implementation, heartbeat
monitoring on the cluster network."

Sage Comment:
We ping using the cluster
interface, but communicate with the mon via the 'public' interface, so
there is a disconnect. Most recently, we observed a clsuter network error
causing failure reports, the mon makring osds down, and the osds marking
themselves back up (bc they could communicate on the front-side).

History

#1 Updated by Sage Weil almost 7 years ago

  • Category set to OSD

#2 Updated by Sage Weil almost 7 years ago

One option: do not mark ourselves back up (after being wrongly marked down) unless we are able to successfully ping at least one of our peers.

If the cluster net is completely broken, this will avoid flapping.

However, if, say, our top-of-rack switch lost its uplink, we may be able to ping peers in our rack but not outside of it, in which case that won't work. Maybe we don't mark ourselves up unless we can ping >50% of our peers? Or some configurable threshold?

If the peers are in fact down, others will fail them. The weakness here is that our peer list is by definition stale (from when we were last up). We can supplement by replacing failed peers with other up osds at random.

If we can't ping anyone and mark ourselves up after some configurable period, we could then suicide, or at least log the problem to the monitor.

#3 Updated by Sage Weil almost 7 years ago

  • Priority changed from Normal to High

#4 Updated by Sage Weil over 6 years ago

  • Tracker changed from Bug to Feature

#5 Updated by Neil Levine over 6 years ago

  • Status changed from New to 12

#6 Updated by Sage Weil over 6 years ago

  • Subject changed from Ping using Cluster Interface, but comms with mon via public interface to osd: gracefully handle cluster network heartbeat failure

#7 Updated by Sage Weil over 6 years ago

  • translation missing: en.field_story_points set to 5.00

#8 Updated by Sage Weil over 6 years ago

  • Target version set to v0.64

#9 Updated by Ian Colle over 6 years ago

  • Assignee set to Sage Weil

#10 Updated by Sage Weil over 6 years ago

  • Status changed from 12 to In Progress

#11 Updated by Ian Colle over 6 years ago

  • Target version changed from v0.64 to v0.65

#12 Updated by Sage Weil over 6 years ago

  • Status changed from In Progress to Fix Under Review

#13 Updated by Sage Weil over 6 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF