Project

General

Profile

Bug #47360

cephadm: osd unit.run creates /var/run/ceph/$FSID too late, so OSD may not start after reboot

Added by Tim Serong about 2 months ago. Updated about 1 month ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
cephadm (binary)
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

The OSD unit.run file currently has the following form:

/usr/bin/podman run [...] -v /var/run/ceph/$FSID:/var/run/ceph:z [...] ceph-volume lvm activate [...]
/usr/bin/install -d -m0770 -o 167 -g 167 /var/run/ceph/$FSID
# osd.$ID
/usr/bin/podman run [...] ceph-osd [...]

i.e. first it invokes podman [...] ceph-volume lvm activate, then creates /var/run/ceph/$FSID, then starts the OSD container. The problem is that the podman [...] ceph-volume lvm activate call will fail because /var/run/ceph/$FSID doesn't exist (you'll see something like 'Error: error checking path "/var/run/ceph/8f5be3a6-f1bb-11ea-9130-525400a64977": stat /var/run/ceph/8f5be3a6-f1bb-11ea-9130-525400a64977: no such file or directory' in the journal and your OSD won't start).

I assume most users have never experienced this, because every other ceph daemon's unit.run file also creates /var/run/ceph/$FSID, so if any other ceph daemon (including the crash daemon, which usually runs on all nodes) starts first, the directory is already created, and so the OSDs start up just fine. During upgrades however, it's entirely possible to adopt a bunch of OSDs, and not start any other services on that node yet (including the crash service), reboot, and then have all the OSDs on that node fail to start. Ouch.

History

#1 Updated by Tim Serong about 2 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 37046

If you've got a node runing OSDs and no other services, this is trivially reproducible by first running ceph orch rm crash to get rid of the crash daemon, then rebooting the OSD node.

#2 Updated by Tim Serong about 1 month ago

  • Status changed from Fix Under Review to Pending Backport

Also available in: Atom PDF