osd: gracefully handle cluster network heartbeat failure
From Kyle Bader
"Back in October we had a switch failure on our cluster (backend) network.
This was not noticed because this network is only utilized when Ceph needs
to do recovery operations. At some point we marked several OSDs out, the
cluster initiated recovery and the cluster started to error at the RADOS
gateway because the cephstores could not gossip or backfill. Heartbeats sent
to the monitor can detect network partitions on the public network and the
cluster can automatically recover, it would be valuable if Ceph was also
fault tolerant with regard to the cluster network."
It seems like it would be a rather simple implementation, heartbeat
monitoring on the cluster network."
We ping using the cluster
interface, but communicate with the mon via the 'public' interface, so
there is a disconnect. Most recently, we observed a clsuter network error
causing failure reports, the mon makring osds down, and the osds marking
themselves back up (bc they could communicate on the front-side).
#2 Updated by Sage Weil over 7 years ago
One option: do not mark ourselves back up (after being wrongly marked down) unless we are able to successfully ping at least one of our peers.
If the cluster net is completely broken, this will avoid flapping.
However, if, say, our top-of-rack switch lost its uplink, we may be able to ping peers in our rack but not outside of it, in which case that won't work. Maybe we don't mark ourselves up unless we can ping >50% of our peers? Or some configurable threshold?
If the peers are in fact down, others will fail them. The weakness here is that our peer list is by definition stale (from when we were last up). We can supplement by replacing failed peers with other up osds at random.
If we can't ping anyone and mark ourselves up after some configurable period, we could then suicide, or at least log the problem to the monitor.