Project

General

Profile

Actions

Feature #3848

closed

osd: gracefully handle cluster network heartbeat failure

Added by Ian Colle over 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Support
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

From Kyle Bader

"Back in October we had a switch failure on our cluster (backend) network.
This was not noticed because this network is only utilized when Ceph needs
to do recovery operations. At some point we marked several OSDs out, the
cluster initiated recovery and the cluster started to error at the RADOS
gateway because the cephstores could not gossip or backfill. Heartbeats sent
to the monitor can detect network partitions on the public network and the
cluster can automatically recover, it would be valuable if Ceph was also
fault tolerant with regard to the cluster network."
It seems like it would be a rather simple implementation, heartbeat
monitoring on the cluster network."

Sage Comment:
We ping using the cluster
interface, but communicate with the mon via the 'public' interface, so
there is a disconnect. Most recently, we observed a clsuter network error
causing failure reports, the mon makring osds down, and the osds marking
themselves back up (bc they could communicate on the front-side).

Actions #1

Updated by Sage Weil over 11 years ago

  • Category set to OSD
Actions #2

Updated by Sage Weil about 11 years ago

One option: do not mark ourselves back up (after being wrongly marked down) unless we are able to successfully ping at least one of our peers.

If the cluster net is completely broken, this will avoid flapping.

However, if, say, our top-of-rack switch lost its uplink, we may be able to ping peers in our rack but not outside of it, in which case that won't work. Maybe we don't mark ourselves up unless we can ping >50% of our peers? Or some configurable threshold?

If the peers are in fact down, others will fail them. The weakness here is that our peer list is by definition stale (from when we were last up). We can supplement by replacing failed peers with other up osds at random.

If we can't ping anyone and mark ourselves up after some configurable period, we could then suicide, or at least log the problem to the monitor.

Actions #3

Updated by Sage Weil about 11 years ago

  • Priority changed from Normal to High
Actions #4

Updated by Sage Weil about 11 years ago

  • Tracker changed from Bug to Feature
Actions #5

Updated by Neil Levine about 11 years ago

  • Status changed from New to 12
Actions #6

Updated by Sage Weil about 11 years ago

  • Subject changed from Ping using Cluster Interface, but comms with mon via public interface to osd: gracefully handle cluster network heartbeat failure
Actions #7

Updated by Sage Weil about 11 years ago

  • Translation missing: en.field_story_points set to 5.00
Actions #8

Updated by Sage Weil almost 11 years ago

  • Target version set to v0.64
Actions #9

Updated by Ian Colle almost 11 years ago

  • Assignee set to Sage Weil
Actions #10

Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to In Progress
Actions #11

Updated by Ian Colle almost 11 years ago

  • Target version changed from v0.64 to v0.65
Actions #12

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to Fix Under Review
Actions #13

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF