Feature #22260: osd: recover after network outages - RADOS - Ceph

Actions

Copy link

Feature #22260

open

osd: recover after network outages

Added by Anonymous over 6 years ago. Updated over 6 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Component(RADOS):

OSD

Pull request ID:

Description

We've run into a situation where after an 802.3ad/lacp enabled switch(es) has been rebooted, some OSDs failed to recover connections to their peers, flooding the logs with "heartbeat_check: no reply..." messages. I will update the description with more details, but the ultimate intent of this ticket is to look at ways we might recover OSD connectivity after the myriad of potential network failures possible in a running cluster.

Actions

Copy link

Updated by Shinobu Kinjo over 6 years ago

Do you want to avoid OSD suicide because of timeout?

Actions

Copy link

Updated by Joao Eduardo Luis over 6 years ago

Subject changed from osd heartbeats not recovering after network outage to osd: recover after network outages
Assignee set to Anonymous
Component(RADOS) OSD added

The behavior Karol looked into would have benefited from a suicide in case of a timeout. Instead, what he saw was OSDs getting stuck when one of the links of the bond went down, and a restart was required to get them back working.

This feature may also end up being a bug chase, if this was in part a bug as well, but Karol's objective with this feature is to identify possible situation in which the OSDs will suffer from connectivity issues, and will attempt to recover from them if possible - or provide information in the logs that will allow the admin to debug it.

The heatbeat suicide timeout is, actually, a good example. Why would the OSD suicide? If it can't reach the other OSDs, or the monitors, the appropriate thing to do would be to stay up waiting for communication to be restored; no? :)

Actions

Copy link

Updated by Anonymous over 6 years ago

Thanks Joao for fielding Shinobu's question.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Feature #22260

osd: recover after network outages

Updated by Shinobu Kinjo over 6 years ago

Updated by Joao Eduardo Luis over 6 years ago

Updated by Anonymous over 6 years ago