osd: recover after network outages
We've run into a situation where after an 802.3ad/lacp enabled switch(es) has been rebooted, some OSDs failed to recover connections to their peers, flooding the logs with "heartbeat_check: no reply..." messages. I will update the description with more details, but the ultimate intent of this ticket is to look at ways we might recover OSD connectivity after the myriad of potential network failures possible in a running cluster.
#2 Updated by Joao Eduardo Luis about 2 years ago
- Subject changed from osd heartbeats not recovering after network outage to osd: recover after network outages
- Assignee set to Anonymous
- Component(RADOS) OSD added
The behavior Karol looked into would have benefited from a suicide in case of a timeout. Instead, what he saw was OSDs getting stuck when one of the links of the bond went down, and a restart was required to get them back working.
This feature may also end up being a bug chase, if this was in part a bug as well, but Karol's objective with this feature is to identify possible situation in which the OSDs will suffer from connectivity issues, and will attempt to recover from them if possible - or provide information in the logs that will allow the admin to debug it.
The heatbeat suicide timeout is, actually, a good example. Why would the OSD suicide? If it can't reach the other OSDs, or the monitors, the appropriate thing to do would be to stay up waiting for communication to be restored; no? :)