Project

General

Profile

Actions

Feature #22260

open

osd: recover after network outages

Added by Anonymous over 6 years ago. Updated over 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
OSD
Pull request ID:

Description

We've run into a situation where after an 802.3ad/lacp enabled switch(es) has been rebooted, some OSDs failed to recover connections to their peers, flooding the logs with "heartbeat_check: no reply..." messages. I will update the description with more details, but the ultimate intent of this ticket is to look at ways we might recover OSD connectivity after the myriad of potential network failures possible in a running cluster.

Actions #1

Updated by Shinobu Kinjo over 6 years ago

Do you want to avoid OSD suicide because of timeout?

Actions #2

Updated by Joao Eduardo Luis over 6 years ago

  • Subject changed from osd heartbeats not recovering after network outage to osd: recover after network outages
  • Assignee set to Anonymous
  • Component(RADOS) OSD added

The behavior Karol looked into would have benefited from a suicide in case of a timeout. Instead, what he saw was OSDs getting stuck when one of the links of the bond went down, and a restart was required to get them back working.

This feature may also end up being a bug chase, if this was in part a bug as well, but Karol's objective with this feature is to identify possible situation in which the OSDs will suffer from connectivity issues, and will attempt to recover from them if possible - or provide information in the logs that will allow the admin to debug it.

The heatbeat suicide timeout is, actually, a good example. Why would the OSD suicide? If it can't reach the other OSDs, or the monitors, the appropriate thing to do would be to stay up waiting for communication to be restored; no? :)

Actions #3

Updated by Anonymous over 6 years ago

Thanks Joao for fielding Shinobu's question.

Actions

Also available in: Atom PDF