Project

General

Profile

Bug #23083

ceph-mgr fails to start after a system reboot on Ubuntu 16.04

Added by Wido den Hollander 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
Start date:
02/22/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

While upgrading a Luminous v12.2.2 system to 12.2.3 I noticed that on all three Monitors the Mgr daemon wouldn't start.

I tried a couple of times, but it just wouldn't start after a reboot. I turned logging to debug_mgr=10 and tried again:

2018-02-22 11:29:43.543272 7fcd70a02700  4 mgr handle_mgr_map active in map: 0 active is 1034134
2018-02-22 11:29:43.544208 7fcd6cff4700 -1 received  signal: Terminated from  PID: 1 task name: /sbin/init  UID: 0
2018-02-22 11:29:43.544231 7fcd6cff4700 -1 mgr handle_signal *** Got signal Terminated ***
2018-02-22 11:29:43.544235 7fcd6cff4700  4 mgr shutdown Shutting down
2018-02-22 11:29:43.546097 7fcd6cff4700 10 mgr shutdown waiting for module dashboard to shutdown
2018-02-22 11:29:43.546112 7fcd6cff4700 20 mgr Gil Switched to new thread state 0x564791f54260
2018-02-22 11:29:43.546454 7fcd6cff4700  4 mgr[dashboard] Stopping server...
2018-02-22 11:29:43.550298 7fcd6cff4700  4 mgr[dashboard] Stopped server
2018-02-22 11:29:43.550327 7fcd6cff4700 20 mgr ~Gil Destroying new thread state 0x564791f54260
2018-02-22 11:29:43.550335 7fcd6cff4700 10 mgr shutdown module dashboard shutdown
2018-02-22 11:29:43.550338 7fcd6cff4700 10 mgr shutdown joining thread for module dashboard
2018-02-22 11:29:43.557250 7fcd5eea5700  4 mgr[dashboard] Engine done.
2018-02-22 11:29:43.557280 7fcd5eea5700 20 mgr ~Gil Destroying new thread state 0x564791f52c60
2018-02-22 11:29:43.557427 7fcd6cff4700 10 mgr shutdown joined thread for module dashboard
2018-02-22 11:29:43.557446 7fcd6cff4700 20 mgr Gil Switched to new thread state 0x564791f54260
2018-02-22 11:29:43.557449 7fcd6cff4700 20 mgr ~Gil Destroying new thread state 0x564791f54260
2018-02-22 11:29:43.557467 7fcd6cff4700 20 mgr Gil Switched to new thread state 0x564791f54260
2018-02-22 11:29:43.557469 7fcd6cff4700 20 mgr ~Gil Destroying new thread state 0x564791f54260
2018-02-22 11:29:43.557474 7fcd6cff4700 20 mgr Gil Switched to new thread state 0x564791f54260
2018-02-22 11:29:43.557475 7fcd6cff4700 20 mgr ~Gil Destroying new thread state 0x564791f54260
2018-02-22 11:29:43.557477 7fcd6cff4700 20 mgr Gil Switched to new thread state 0x564791f54260
2018-02-22 11:29:43.557480 7fcd6cff4700 20 mgr ~Gil Destroying new thread state 0x564791f54260
2018-02-22 11:29:43.557482 7fcd6cff4700 20 mgr Gil Switched to new thread state 0x564791f54260
2018-02-22 11:29:43.557484 7fcd6cff4700 20 mgr ~Gil Destroying new thread state 0x564791f54260
2018-02-22 11:30:49.327560 7fa18c91d680  0 set uid:gid to 64045:64045 (ceph:ceph)
2018-02-22 11:30:49.327574 7fa18c91d680  0 ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) luminous (stable), process (unknown), pid 1524
2018-02-22 11:30:49.330552 7fa18c91d680  0 pidfile_write: ignore empty --pid-file
2018-02-22 11:30:49.598648 7f6b62bf5680  0 set uid:gid to 64045:64045 (ceph:ceph)
2018-02-22 11:30:49.598661 7f6b62bf5680  0 ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) luminous (stable), process (unknown), pid 1689
2018-02-22 11:30:49.598809 7f6b62bf5680  0 pidfile_write: ignore empty --pid-file
2018-02-22 11:30:49.791143 7fa2e38cc680  0 set uid:gid to 64045:64045 (ceph:ceph)
2018-02-22 11:30:49.791157 7fa2e38cc680  0 ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) luminous (stable), process (unknown), pid 1728
2018-02-22 11:30:49.791340 7fa2e38cc680  0 pidfile_write: ignore empty --pid-file

Here you can see that the Mgr was shut down for the reboot and after the reboot systemd tried to start it a couple of times.

systemd also shows that the mgr failed to start:

root@mon03:~# systemctl status ceph-mgr@mon03
   ceph-mgr@mon03.service - Ceph cluster manager daemon
   Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled)
   Active: inactive (dead) (Result: exit-code) since Thu 2018-02-22 11:30:50 CET; 2min 29s ago
  Process: 1728 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=255)
 Main PID: 1728 (code=exited, status=255)

Feb 22 11:30:49 mon03.ceph.dz57927.ams02.cldin.net systemd[1]: ceph-mgr@mon03.service: Unit entered failed state.
Feb 22 11:30:49 mon03.ceph.dz57927.ams02.cldin.net systemd[1]: ceph-mgr@mon03.service: Failed with result 'exit-code'.
Feb 22 11:30:50 mon03.ceph.dz57927.ams02.cldin.net systemd[1]: ceph-mgr@mon03.service: Service hold-off time over, scheduling restart.
Feb 22 11:30:50 mon03.ceph.dz57927.ams02.cldin.net systemd[1]: Stopped Ceph cluster manager daemon.
Feb 22 11:30:50 mon03.ceph.dz57927.ams02.cldin.net systemd[1]: ceph-mgr@mon03.service: Start request repeated too quickly.
Feb 22 11:30:50 mon03.ceph.dz57927.ams02.cldin.net systemd[1]: Failed to start Ceph cluster manager daemon.
root@mon03:~#

A start doesn't work either:

root@mon03:~# systemctl start ceph-mgr@mon03
Job for ceph-mgr@mon03.service failed because the control process exited with error code. See "systemctl status ceph-mgr@mon03.service" and "journalctl -xe" for details.
root@mon03:~#

Eventually reset-failed and start is required:

root@mon03:~# systemctl reset-failed ceph-mgr@mon03
root@mon03:~# systemctl start ceph-mgr@mon03
root@mon03:~#

I've seen this happen a few times and I have a feeling it is IPv6 related (this cluster runs IPv6).

The manager somehow seems to spawn prior to networking being enabled properly, but I haven't found any real evidence yet however.


Related issues

Copied to mgr - Backport #23101: luminous: ceph-mgr fails to start after a system reboot on Ubuntu 16.04 Resolved

History

#2 Updated by Sage Weil 11 months ago

  • Status changed from New to Pending Backport
  • Backport set to luminous

#3 Updated by Nathan Cutler 11 months ago

  • Copied to Backport #23101: luminous: ceph-mgr fails to start after a system reboot on Ubuntu 16.04 added

#4 Updated by Nathan Cutler 10 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF