Bug #42761
systemd restarts OSD too fast after failure
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (dev)
Tags:
systemd,ipv6
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Description
While upgrading a cluster from Mimic to Nautilus I noticed that after a reboot the OSDs wouldn't start on allmost all servers.
I checked journalctl and this showed me on all servers:
Nov 12 07:04:51 XXYY systemd[1]: Starting Ceph object storage daemon osd.576... Nov 12 07:04:51 XXYY systemd[1]: Started Ceph object storage daemon osd.576. Nov 12 07:04:52 XXYY ceph-osd[2400]: server name not found: ceph-monitor.ceph.mydomain (Name or service not known) Nov 12 07:04:52 XXYY ceph-osd[2400]: unable to parse addrs in 'ceph-monitor.ceph.mydomain' Nov 12 07:04:52 XXYY ceph-osd[2400]: 2019-11-12 07:04:52.146 7fc9a84f8dc0 -1 monclient: get_monmap_and_config cannot identify monitors to contact Nov 12 07:04:52 XXYY ceph-osd[2400]: failed to fetch mon config (--no-mon-config to skip) Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service: main process exited, code=exited, status=1/FAILURE Nov 12 07:04:52 XXYY systemd[1]: Unit ceph-osd@576.service entered failed state. Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service failed. Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service holdoff time over, scheduling restart. Nov 12 07:04:52 XXYY systemd[1]: Stopped Ceph object storage daemon osd.576. Nov 12 07:04:52 XXYY systemd[1]: Starting Ceph object storage daemon osd.576... Nov 12 07:04:52 XXYY systemd[1]: Started Ceph object storage daemon osd.576. Nov 12 07:04:52 XXYY ceph-osd[2533]: server name not found: ceph-monitor.ceph.mydomain (Name or service not known) Nov 12 07:04:52 XXYY ceph-osd[2533]: unable to parse addrs in 'ceph-monitor.ceph.mydomain' Nov 12 07:04:52 XXYY ceph-osd[2533]: 2019-11-12 07:04:52.709 7fb510d8cdc0 -1 monclient: get_monmap_and_config cannot identify monitors to contact Nov 12 07:04:52 XXYY ceph-osd[2533]: failed to fetch mon config (--no-mon-config to skip) Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service: main process exited, code=exited, status=1/FAILURE Nov 12 07:04:52 XXYY systemd[1]: Unit ceph-osd@576.service entered failed state. Nov 12 07:04:52 XXYY systemd[1]: ceph-osd@576.service failed. Nov 12 07:04:53 XXYY systemd[1]: ceph-osd@576.service holdoff time over, scheduling restart. Nov 12 07:04:53 XXYY systemd[1]: Stopped Ceph object storage daemon osd.576. Nov 12 07:04:53 XXYY systemd[1]: Starting Ceph object storage daemon osd.576... Nov 12 07:04:53 XXYY systemd[1]: Started Ceph object storage daemon osd.576. Nov 12 07:04:53 XXYY ceph-osd[2678]: server name not found: ceph-monitor.ceph.mydomain (Name or service not known) Nov 12 07:04:53 XXYY ceph-osd[2678]: unable to parse addrs in 'ceph-monitor.ceph.mydomain' Nov 12 07:04:53 XXYY ceph-osd[2678]: 2019-11-12 07:04:53.323 7f0e1e0b5dc0 -1 monclient: get_monmap_and_config cannot identify monitors to contact Nov 12 07:04:53 XXYY ceph-osd[2678]: failed to fetch mon config (--no-mon-config to skip)
Here you see a start of the OSD was attempted on:
- Nov 12 07:04:51
- Nov 12 07:04:52
- Nov 12 07:04:53
The failure was due to the fact that the (IPv6) network was not online yet while the system booted.
History
#1 Updated by Wido den Hollander over 4 years ago
There was a change in this commit: https://github.com/ceph/ceph/commit/92f8ec5c0ef8bf500d9c608b6f372363f95629a4
RestartSec=20s
Was removed and causes this problem.
network-online.target isn't 100% reliable as IPv6 might still be busy with Duplicate Address Detection.