Bug #24368: osd: should not restart on permanent failures - RADOS - Ceph

Actions

Copy link

Bug #24368

closed

osd: should not restart on permanent failures

Added by Greg Farnum almost 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Greg Farnum

Category:

Administration/Usability

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Last week at OpenStack I heard a few users report OSDs were not failing hard and fast as they should be on disk issues. For some of them, there were definitely multiple causes. But one of the easy ones is that systemd (especially as we configure it) tries to keep services running, so when an OSD crashes it gets restarted and tries to rejoin the cluster.

There are two different approaches to take here:
1) Modify how frequently systemd can restart the service. (Changing the StartLimitInterval and StartLimitBurst values)
2) Modify in what cases systemd restarts the service. It turns out you can configure varying combinations of the ways for a process to exit to behave differently (in systemd: "Clean exit code or signal", "Unclean exit code", "Unclean signal", "Timeout", "Watchdog" are handled differently in the 6 options for when to restart on exist), AND you can specify that the service shouldn't restart on specific return values or signals. I'm not sure if our exit statuses are distinct enough for that to be useful right now, but we can definitely get there!

Actions

Copy link

Updated by Greg Farnum almost 6 years ago

See https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart= for the details on Restart options.

Actions

Copy link

Updated by Greg Farnum almost 6 years ago

Status changed from New to In Progress
Assignee set to Greg Farnum

https://github.com/ceph/ceph/pull/22349 has the simple restart interval change. Will investigate the options for conditionally limiting restarts.

Actions

Copy link

Updated by Nathan Cutler almost 6 years ago

Backport set to mimic, luminous

Sounds like something that would be useful in our stable releases - Greg, do you agree?

Actions

Copy link

Updated by Greg Farnum almost 6 years ago

It would, but the previous settings were there for a reason so I'm not sure if it's feasible to backport this for ceph-disk users, or if they'd hit the startup race discussed in that PR commit.

Although maybe hitting a startup race is better than OSDs taking forever to get kicked out of the cluster, anyway?
Planning and need to investigate it more. :)

Actions

Copy link

Updated by guotao Yao almost 6 years ago

I've had a similar problem recently. One OSD crash and exit, and the OSD process starts up quickly by systemd. It causes the OSD to flaping up and down. Many PGs are always in the peering state. The client I/O drops sharply.

I found that this was caused by the systemd service. and I found the options StartLimitInterval and StartLimitBurst limit the number of service restarts. I counted the time between service startup and crash exit from the OSD log. It restart twice a minute in my scenario, and I set the StartLimitInterval is 2 minutes, and the StartLimitBurst is 2.

I also want to know how the Ceph community solves this issue.

Actions

Copy link

Updated by guotao Yao almost 6 years ago

guotao Yao wrote:

I've had a similar problem recently. One OSD crash and exit, and the OSD process starts up quickly by systemd. It causes the OSD to flaping up and down. Many PGs are always in the peering state. The client I/O drops sharply.

I found that this was caused by the systemd service. and I found the options StartLimitInterval and StartLimitBurst limit the number of service restarts. I counted the time between service startup and crash exit from the OSD log. It restart twice a minute in my scenario, and I set the StartLimitInterval is 2 minutes, and the StartLimitBurst is 2.

I also want to know how the Ceph community solves this issue.

In addition, I didn't change the RestartSec parameter, but I kept it for 20 seconds.

Actions

Copy link

Updated by Greg Farnum almost 6 years ago

I don't think the issue has moved beyond the PR linked above to change the systemd settings. I sent this out to one or two large users and was hoping to get some reports back on how it worked before doing any backports.

I looked briefly at if we could somewhat easily tell systemd how it should behave by tweaking our exit codes or signals, but it didn't seem like a short project, so I let it fall. :(

Actions

Copy link

Updated by Greg Farnum over 5 years ago

From a user:

There is some class of OSD out there (all filestore, IIRC) that are ultra slow to start at boot time in Luminous.

So on those machines I've been applying this change to ceph-osd@.service:

- RestartSec=20s
+ RestartSec=1s

And it fixes everything :)

Actions

Copy link

Updated by Greg Farnum over 5 years ago

Status changed from In Progress to Resolved

Okay, after discussing with CERN I've merged the PR to master so this isn't an issue going forward.

But unfortunately I think we're just going to have to live with it on existing installs, as the ceph-disk races remain common and require so many restarts as to render this tuning pretty broken. :/

Actions

Copy link

#10

Updated by Nathan Cutler over 5 years ago

Backport deleted (~~mimic, luminous~~)

Clearing backport field on the assumption that's what was intended by the previous edit.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #24368

osd: should not restart on permanent failures

Updated by Greg Farnum almost 6 years ago

Updated by Greg Farnum almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Greg Farnum almost 6 years ago

Updated by guotao Yao almost 6 years ago

Updated by guotao Yao almost 6 years ago

Updated by Greg Farnum almost 6 years ago

Updated by Greg Farnum over 5 years ago

Updated by Greg Farnum over 5 years ago

Updated by Nathan Cutler over 5 years ago