Project

General

Profile

Bug #42761

systemd restarts OSD too fast after failure

Added by Wido den Hollander over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
systemd,ipv6
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While upgrading a cluster from Mimic to Nautilus I noticed that after a reboot the OSDs wouldn't start on allmost all servers.

I checked journalctl and this showed me on all servers:

Nov 12 07:04:51 XXYY systemd[1]: Starting Ceph object storage daemon osd.576...
Nov 12 07:04:51 XXYY systemd[1]: Started Ceph object storage daemon osd.576.
Nov 12 07:04:52 XXYY ceph-osd[2400]: server name not found: ceph-monitor.ceph.mydomain (Name or service not known)
Nov 12 07:04:52 XXYY ceph-osd[2400]: unable to parse addrs in 'ceph-monitor.ceph.mydomain'
Nov 12 07:04:52 XXYY ceph-osd[2400]: 2019-11-12 07:04:52.146 7fc9a84f8dc0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Nov 12 07:04:52 XXYY ceph-osd[2400]: failed to fetch mon config (--no-mon-config to skip)
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service: main process exited, code=exited, status=1/FAILURE
Nov 12 07:04:52 XXYY systemd[1]: Unit ceph-osd@576.service entered failed state.
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service failed.
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service holdoff time over, scheduling restart.
Nov 12 07:04:52 XXYY systemd[1]: Stopped Ceph object storage daemon osd.576.
Nov 12 07:04:52 XXYY systemd[1]: Starting Ceph object storage daemon osd.576...
Nov 12 07:04:52 XXYY systemd[1]: Started Ceph object storage daemon osd.576.
Nov 12 07:04:52 XXYY ceph-osd[2533]: server name not found: ceph-monitor.ceph.mydomain (Name or service not known)
Nov 12 07:04:52 XXYY ceph-osd[2533]: unable to parse addrs in 'ceph-monitor.ceph.mydomain'
Nov 12 07:04:52 XXYY ceph-osd[2533]: 2019-11-12 07:04:52.709 7fb510d8cdc0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Nov 12 07:04:52 XXYY ceph-osd[2533]: failed to fetch mon config (--no-mon-config to skip)
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service: main process exited, code=exited, status=1/FAILURE
Nov 12 07:04:52 XXYY systemd[1]: Unit ceph-osd@576.service entered failed state.
Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service failed.
Nov 12 07:04:53 XXYY systemd[1]: ceph-osd@576.service holdoff time over, scheduling restart.
Nov 12 07:04:53 XXYY systemd[1]: Stopped Ceph object storage daemon osd.576.
Nov 12 07:04:53 XXYY systemd[1]: Starting Ceph object storage daemon osd.576...
Nov 12 07:04:53 XXYY systemd[1]: Started Ceph object storage daemon osd.576.
Nov 12 07:04:53 XXYY ceph-osd[2678]: server name not found: ceph-monitor.ceph.mydomain (Name or service not known)
Nov 12 07:04:53 XXYY ceph-osd[2678]: unable to parse addrs in 'ceph-monitor.ceph.mydomain'
Nov 12 07:04:53 XXYY ceph-osd[2678]: 2019-11-12 07:04:53.323 7f0e1e0b5dc0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Nov 12 07:04:53 XXYY ceph-osd[2678]: failed to fetch mon config (--no-mon-config to skip)

Here you see a start of the OSD was attempted on:

- Nov 12 07:04:51
- Nov 12 07:04:52
- Nov 12 07:04:53

The failure was due to the fact that the (IPv6) network was not online yet while the system booted.

History

#1 Updated by Wido den Hollander over 4 years ago

There was a change in this commit: https://github.com/ceph/ceph/commit/92f8ec5c0ef8bf500d9c608b6f372363f95629a4

RestartSec=20s

Was removed and causes this problem.

network-online.target isn't 100% reliable as IPv6 might still be busy with Duplicate Address Detection.

Also available in: Atom PDF