Bug #64424: Ceph orch unsuitable for stateless / RAM-booted hosts - Orchestrator - Ceph

Actions

Copy link

Bug #64424

open

Ceph orch unsuitable for stateless / RAM-booted hosts

Added by Janek Bevendorff 3 months ago. Updated 3 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

cephadm/services

Target version:

Ceph - v17.2.8

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v17.2.8

ceph-qa-suite:

ceph-deploy

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi, I'm extremely unhappy with how the new Ceph orchestrator handles node reboots, especially if those nodes are RAM-booted.

I have 78 hosts with 16-20 OSDs each. The hosts are all PXE-booted from a network-provisioned image, i.e. they are statelesss and require redeployment of OSDs after each reboot. Ceph orch's reconciliation time after a reboot is around 25-30 minutes, which is absolute unacceptable. I.e. I have to wait half an hour after rebooting a node before I can reboot the next. That makes a rolling reboot, which usually takes a few hours, a 2-day ordeal.

The only way around this I found was manually deploying all OSDs with a custom script using cephadm deploy. However, then I have to mess with the unit.meta file to sort OSDs into the appropriate orch service and replicate all the filter logic that I have in my service YAML.

One potential solution I see here is a local cephadm command that triggers an immediate reconciliation for the current host, redeploying all necessary services.

Since this is such a critical part of the whole system, I'm labelling this as a bug report rather than a feature request. Feel free to relabel.

Actions

Copy link

Updated by Janek Bevendorff 3 months ago

Another issue here: If /var/lib/ceph is persistent, but /etc/ceph is not, then ceph orch will not redeploy anything. Instead, all OSDs will be listed as "stopped" in ceph orch ps hostname.

Actions

Copy link

Updated by Janek Bevendorff 3 months ago

(Or /etc/systemd/system rather)

I have to manually start all of them with ceph orch daemon start osd.X. That will start the OSDs again, but it will not recreate the systemd service files in /etc/systemd/system/ceph-<FSID>.target.wants

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #64424

Ceph orch unsuitable for stateless / RAM-booted hosts

Updated by Janek Bevendorff 3 months ago

Updated by Janek Bevendorff 3 months ago