Bug #45973: Adopted MDS daemons are removed by the orchestrator because they're orphans - Orchestrator - Ceph

Actions

Copy link

Bug #45973

closed

Adopted MDS daemons are removed by the orchestrator because they're orphans

Added by Tim Serong almost 4 years ago. Updated about 3 years ago.

Status:

Rejected

Priority:

Low

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The docs say that when converting to cephadm, one needs to redeploy MDS daemons. However, it is possible to adopt them (cephadm adopt [...] --name mds.myhost seems to work just fine). The problem is that shortly after being adopted, the cephadm orchestrator decides that the MDS is an orphan (there's no service spec), and goes and removes the daemon.

If the correct procedure is always to redeploy, and never to adopt an MDS, then cephadm adopt should be presumably be changed to refuse to adopt MDSes (the same is possibly true for RGW, but I haven't verified this).

If, on the other hand, it's permitted to adopt an MDS, then I guess a service spec needs to be created for it automatically?

What's the right thing to do here?

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

Hm. Isn't this a big flaw in adopt, not just for MDS?

We might need to apply something like this before adopting any daemons

service_type: mds
service_id: XXX
unmanaged: true

And run something like

service_type: mds
service_id: XXX
unmanaged: false
placement: ...

after the adoption is done.

Actions

Copy link

Updated by Tim Serong almost 4 years ago

Sebastian Wagner wrote:

Hm. Isn't this a big flaw in adopt, not just for MDS?

Not in practice so far. The docs say to adopt MON, MGR and OSD, and to redeploy everything else. The cephadm orchestrator doesn't care if MON, MGR and OSD don't have service specs (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1889), so doesn't remove them as orphans.

That said, the comment on https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1890 claims that MON and MGR specs should always exist, and the fact that they don't after an adopt may mean that this all only works by accident. In which case, yes, this probably needs some attention.

We might need to apply something like this before adopting any daemons
[...]

How do we apply service specs before adoption? The orchestrator can't be enabled until after MONs and MGRs are adopted...

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

It's not an accident that this is working. OTOH, this needs behavior needs improvement. Let me think about the chicken-and-egg problem a bit.

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Tim Serong almost 4 years ago

We have the same problem with adopted prometheus instances (I adopted one, it was working fine for a few minutes, then the orhcestrator went and removed it)

Actions

Copy link