Project

General

Profile

Bug #45973

Adopted MDS daemons are removed by the orchestrator because they're orphans

Added by Tim Serong 2 months ago. Updated about 2 months ago.

Status:
Fix Under Review
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

The docs say that when converting to cephadm, one needs to redeploy MDS daemons. However, it is possible to adopt them (cephadm adopt [...] --name mds.myhost seems to work just fine). The problem is that shortly after being adopted, the cephadm orchestrator decides that the MDS is an orphan (there's no service spec), and goes and removes the daemon.

If the correct procedure is always to redeploy, and never to adopt an MDS, then cephadm adopt should be presumably be changed to refuse to adopt MDSes (the same is possibly true for RGW, but I haven't verified this).

If, on the other hand, it's permitted to adopt an MDS, then I guess a service spec needs to be created for it automatically?

What's the right thing to do here?

History

#1 Updated by Sebastian Wagner 2 months ago

Hm. Isn't this a big flaw in adopt, not just for MDS?

We might need to apply something like this before adopting any daemons

service_type: mds
service_id: XXX
unmanaged: true

And run something like

service_type: mds
service_id: XXX
unmanaged: false
placement: ...

after the adoption is done.

#2 Updated by Tim Serong about 2 months ago

Sebastian Wagner wrote:

Hm. Isn't this a big flaw in adopt, not just for MDS?

Not in practice so far. The docs say to adopt MON, MGR and OSD, and to redeploy everything else. The cephadm orchestrator doesn't care if MON, MGR and OSD don't have service specs (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1889), so doesn't remove them as orphans.

That said, the comment on https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1890 claims that MON and MGR specs should always exist, and the fact that they don't after an adopt may mean that this all only works by accident. In which case, yes, this probably needs some attention.

We might need to apply something like this before adopting any daemons
[...]

How do we apply service specs before adoption? The orchestrator can't be enabled until after MONs and MGRs are adopted...

#3 Updated by Sebastian Wagner about 2 months ago

It's not an accident that this is working. OTOH, this needs behavior needs improvement. Let me think about the chicken-and-egg problem a bit.

#4 Updated by Sebastian Wagner about 2 months ago

  • Priority changed from Normal to High

#5 Updated by Tim Serong about 2 months ago

We have the same problem with adopted prometheus instances (I adopted one, it was working fine for a few minutes, then the orhcestrator went and removed it)

#6 Updated by Sebastian Wagner about 2 months ago

  • Status changed from New to Fix Under Review
  • Assignee set to Sebastian Wagner
  • Pull request ID set to 35669

Also available in: Atom PDF