Project

General

Profile

Bug #24368

osd: should not restart on permanent failures

Added by Greg Farnum about 1 year ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Administration/Usability
Target version:
-
Start date:
05/31/2018
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

Last week at OpenStack I heard a few users report OSDs were not failing hard and fast as they should be on disk issues. For some of them, there were definitely multiple causes. But one of the easy ones is that systemd (especially as we configure it) tries to keep services running, so when an OSD crashes it gets restarted and tries to rejoin the cluster.

There are two different approaches to take here:
1) Modify how frequently systemd can restart the service. (Changing the StartLimitInterval and StartLimitBurst values)
2) Modify in what cases systemd restarts the service. It turns out you can configure varying combinations of the ways for a process to exit to behave differently (in systemd: "Clean exit code or signal", "Unclean exit code", "Unclean signal", "Timeout", "Watchdog" are handled differently in the 6 options for when to restart on exist), AND you can specify that the service shouldn't restart on specific return values or signals. I'm not sure if our exit statuses are distinct enough for that to be useful right now, but we can definitely get there!

History

#2 Updated by Greg Farnum about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Greg Farnum

https://github.com/ceph/ceph/pull/22349 has the simple restart interval change. Will investigate the options for conditionally limiting restarts.

#3 Updated by Nathan Cutler about 1 year ago

  • Backport set to mimic, luminous

Sounds like something that would be useful in our stable releases - Greg, do you agree?

#4 Updated by Greg Farnum about 1 year ago

It would, but the previous settings were there for a reason so I'm not sure if it's feasible to backport this for ceph-disk users, or if they'd hit the startup race discussed in that PR commit.

Although maybe hitting a startup race is better than OSDs taking forever to get kicked out of the cluster, anyway?
Planning and need to investigate it more. :)

#5 Updated by guotao Yao about 1 year ago

I've had a similar problem recently. One OSD crash and exit, and the OSD process starts up quickly by systemd. It causes the OSD to flaping up and down. Many PGs are always in the peering state. The client I/O drops sharply.

I found that this was caused by the systemd service. and I found the options StartLimitInterval and StartLimitBurst limit the number of service restarts. I counted the time between service startup and crash exit from the OSD log. It restart twice a minute in my scenario, and I set the StartLimitInterval is 2 minutes, and the StartLimitBurst is 2.

I also want to know how the Ceph community solves this issue.

#6 Updated by guotao Yao about 1 year ago

guotao Yao wrote:

I've had a similar problem recently. One OSD crash and exit, and the OSD process starts up quickly by systemd. It causes the OSD to flaping up and down. Many PGs are always in the peering state. The client I/O drops sharply.

I found that this was caused by the systemd service. and I found the options StartLimitInterval and StartLimitBurst limit the number of service restarts. I counted the time between service startup and crash exit from the OSD log. It restart twice a minute in my scenario, and I set the StartLimitInterval is 2 minutes, and the StartLimitBurst is 2.

I also want to know how the Ceph community solves this issue.

In addition, I didn't change the RestartSec parameter, but I kept it for 20 seconds.

#7 Updated by Greg Farnum about 1 year ago

I don't think the issue has moved beyond the PR linked above to change the systemd settings. I sent this out to one or two large users and was hoping to get some reports back on how it worked before doing any backports.

I looked briefly at if we could somewhat easily tell systemd how it should behave by tweaking our exit codes or signals, but it didn't seem like a short project, so I let it fall. :(

#8 Updated by Greg Farnum 9 months ago

From a user:

There is some class of OSD out there (all filestore, IIRC) that are ultra slow to start at boot time in Luminous.

So on those machines I've been applying this change to ceph-osd@.service:

- RestartSec=20s
+ RestartSec=1s

And it fixes everything :)

#9 Updated by Greg Farnum 9 months ago

  • Status changed from In Progress to Resolved

Okay, after discussing with CERN I've merged the PR to master so this isn't an issue going forward.

But unfortunately I think we're just going to have to live with it on existing installs, as the ceph-disk races remain common and require so many restarts as to render this tuning pretty broken. :/

#10 Updated by Nathan Cutler 9 months ago

  • Backport deleted (mimic, luminous)

Clearing backport field on the assumption that's what was intended by the previous edit.

Also available in: Atom PDF