Bug #24368: osd: should not restart on permanent failures - RADOS - Ceph

Actions

Copy link

Bug #24368

closed

osd: should not restart on permanent failures

Added by Greg Farnum almost 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Greg Farnum

Category:

Administration/Usability

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Last week at OpenStack I heard a few users report OSDs were not failing hard and fast as they should be on disk issues. For some of them, there were definitely multiple causes. But one of the easy ones is that systemd (especially as we configure it) tries to keep services running, so when an OSD crashes it gets restarted and tries to rejoin the cluster.

There are two different approaches to take here:
1) Modify how frequently systemd can restart the service. (Changing the StartLimitInterval and StartLimitBurst values)
2) Modify in what cases systemd restarts the service. It turns out you can configure varying combinations of the ways for a process to exit to behave differently (in systemd: "Clean exit code or signal", "Unclean exit code", "Unclean signal", "Timeout", "Watchdog" are handled differently in the 6 options for when to restart on exist), AND you can specify that the service shouldn't restart on specific return values or signals. I'm not sure if our exit statuses are distinct enough for that to be useful right now, but we can definitely get there!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #24368

osd: should not restart on permanent failures

Updated by Greg Farnum almost 6 years ago

Updated by Greg Farnum almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Greg Farnum almost 6 years ago

Updated by guotao Yao almost 6 years ago

Updated by guotao Yao almost 6 years ago

Updated by Greg Farnum almost 6 years ago

Updated by Greg Farnum over 5 years ago

Updated by Greg Farnum over 5 years ago

Updated by Nathan Cutler over 5 years ago