Project

General

Profile

Actions

Bug #24368

closed

osd: should not restart on permanent failures

Added by Greg Farnum almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Last week at OpenStack I heard a few users report OSDs were not failing hard and fast as they should be on disk issues. For some of them, there were definitely multiple causes. But one of the easy ones is that systemd (especially as we configure it) tries to keep services running, so when an OSD crashes it gets restarted and tries to rejoin the cluster.

There are two different approaches to take here:
1) Modify how frequently systemd can restart the service. (Changing the StartLimitInterval and StartLimitBurst values)
2) Modify in what cases systemd restarts the service. It turns out you can configure varying combinations of the ways for a process to exit to behave differently (in systemd: "Clean exit code or signal", "Unclean exit code", "Unclean signal", "Timeout", "Watchdog" are handled differently in the 6 options for when to restart on exist), AND you can specify that the service shouldn't restart on specific return values or signals. I'm not sure if our exit statuses are distinct enough for that to be useful right now, but we can definitely get there!

Actions

Also available in: Atom PDF