Project

General

Profile

Fix #15419

ceph-{mds,mon,osd,radosgw} systemd unit files need "wants=time-sync.target"

Added by Nathan Cutler almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
04/07/2016
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
jewel
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

It sometimes happens, when starting up an entire cluster at once, that a MON or OSD starts before ntp (or systemd-timesyncd or chrony) has a chance to synchronize the clock. When it happens to a MON, the cluster comes up in HEALTH_WARN due to clock skew. Joao added some code to the MON in #14175 to make the MON cluster recover from this quicker, but the quickest fix is to restart the offending MONs.

I have been spinning up clusters in Amazon Web Services (AWS) and I've found that this racing between the ntpd.service and the ceph services is not limited just to ceph-mon. If an OSD starts before the clock is synced, the cluster starts in HEALTH_WARN and all the PGs the offending OSD participates in get stuck in "Peering" state. This disappears when the OSD is restarted.

The suggested fix is to add:

Wants=time-sync.target
After=time-sync.target

to the ceph-{mds,mon,osd,radosgw} systemd unit files. This will ensure that the ntpd/chrony/systemd-timesyncd service is started before the respective Ceph daemon starts.


Related issues

Copied to devops - Backport #15606: jewel: ceph-{mds,mon,osd,radosgw} systemd unit files need "wants=time-sync.target" Resolved

History

#2 Updated by Nathan Cutler almost 3 years ago

  • Status changed from In Progress to Need Review

#3 Updated by Sage Weil almost 3 years ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel

#4 Updated by Nathan Cutler almost 3 years ago

  • Copied to Backport #15606: jewel: ceph-{mds,mon,osd,radosgw} systemd unit files need "wants=time-sync.target" added

#5 Updated by Fabian Grünbichler almost 3 years ago

The "Wants=time-sync.target" is wrong here according to 'man systemd.special':

"A number of special system targets are defined that can be used to properly order boot-up of optional services. These targets are generally not part of the
initial boot transaction, unless they are explicitly pulled in by one of the implementing services. Note specifically that these passive target units are
generally not pulled in by the consumer of a service, but by the provider of the service. This means: a consuming service should order itself after these
targets (as appropriate), but not pull it in. A providing service should order itself before these targets (as appropriate) and pull it in (via a Wants= type
dependency)."

and

"time-sync.target
Services responsible for synchronizing the system clock from a remote source (such as NTP client implementations) should pull in this target and order
themselves before it. All services where correct time is essential should be ordered after this unit, but not pull it in. systemd automatically adds
dependencies of type After= for this target unit to all SysV init script service units with an LSB header referring to the "$time" facility."

What you want is probably only "After=time-sync.target"

#6 Updated by Nathan Cutler almost 3 years ago

Before this change we had:

Wants=network-online.target local-fs.target
After=network-online.target local-fs.target

After the change we have:

Wants=network-online.target local-fs.target time-sync.target
After=network-online.target local-fs.target time-sync.target

If the "Wants=" is only for targets provided by us, we should not have the "Wants=" line at all. Correct?

#7 Updated by Nathan Cutler almost 3 years ago

Hm, should have read the manpage. Thanks for pointing me to it. I will modify it so it looks like this:

Wants=network-online.target local-fs.target
After=network-online.target local-fs.target time-sync.target

I tested it pretty thoroughly, though, so I wonder if the "Wants=time-sync.target" is actively harmful or just superfluous.

#8 Updated by Fabian Grünbichler almost 3 years ago

Not sure if this is just "cosmetic", or if it might cause problems (dependency cycles?).

#9 Updated by Loic Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF