Bug #45973
closed
Adopted MDS daemons are removed by the orchestrator because they're orphans
Added by Tim Serong almost 4 years ago.
Updated about 3 years ago.
Description
The docs say that when converting to cephadm, one needs to redeploy MDS daemons. However, it is possible to adopt them (cephadm adopt [...] --name mds.myhost
seems to work just fine). The problem is that shortly after being adopted, the cephadm orchestrator decides that the MDS is an orphan (there's no service spec), and goes and removes the daemon.
If the correct procedure is always to redeploy, and never to adopt an MDS, then cephadm adopt
should be presumably be changed to refuse to adopt MDSes (the same is possibly true for RGW, but I haven't verified this).
If, on the other hand, it's permitted to adopt an MDS, then I guess a service spec needs to be created for it automatically?
What's the right thing to do here?
Related issues
1 (1 open — 0 closed)
Hm. Isn't this a big flaw in adopt, not just for MDS?
We might need to apply something like this before adopting any daemons
service_type: mds
service_id: XXX
unmanaged: true
And run something like
service_type: mds
service_id: XXX
unmanaged: false
placement: ...
after the adoption is done.
Sebastian Wagner wrote:
Hm. Isn't this a big flaw in adopt, not just for MDS?
Not in practice so far. The docs say to adopt MON, MGR and OSD, and to redeploy everything else. The cephadm orchestrator doesn't care if MON, MGR and OSD don't have service specs (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1889), so doesn't remove them as orphans.
That said, the comment on https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1890 claims that MON and MGR specs should always exist, and the fact that they don't after an adopt may mean that this all only works by accident. In which case, yes, this probably needs some attention.
We might need to apply something like this before adopting any daemons
[...]
How do we apply service specs before adoption? The orchestrator can't be enabled until after MONs and MGRs are adopted...
It's not an accident that this is working. OTOH, this needs behavior needs improvement. Let me think about the chicken-and-egg problem a bit.
- Priority changed from Normal to High
We have the same problem with adopted prometheus instances (I adopted one, it was working fine for a few minutes, then the orhcestrator went and removed it)
- Status changed from New to Fix Under Review
- Assignee set to Sebastian Wagner
- Pull request ID set to 35669
- Status changed from Fix Under Review to New
- Assignee deleted (
Sebastian Wagner)
- Pull request ID deleted (
35669)
- Priority changed from High to Low
prio=low. probably easier to simply redeploy MDS for upstream and find a typical downstream solution for downstream.
- Related to Bug #46561: cephadm: monitoring services adoption doesn't honor the container image added
- Status changed from New to Rejected
fixed by both downstreams
Also available in: Atom
PDF