Bug #58229
opencephadm: maintenance mode doesn't account for which host has a service running/errored
0%
Description
It looks like where (which host) services are running on is not correctly considered when ceph orch host maintenance enter A
is requested. It looks like it simply considers the count of running services and whether the host should have a service running, ignoring the fact that the service could be in "error" mode on the host and actually it was running on another host.
- Deploy service
X
(say, MGR) on 2 hosts:A
andB
. - Get
X
to error state onA
- it could be running but cephadm just doesn't know that it is running, for example cluster could be in stateCEPHADM_FAILED_DAEMON
(this should be a subject for a separate bug report, but ignore the specifics for now). What's important is thatX
is running on one host and we want to have the other host in maintenance mode. X
is now running "1/2", with cephadm knowing it definitely is OK onB
, for example:ceph orch ls|grep mgr NAME PORTS RUNNING REFRESHED AGE PLACEMENT mgr 1/2 36s ago 3M count:2 ceph orch ps|grep mgr mgr.A A *:9283 error 105s ago 3D 77.9M - 17.2.3 mgr.B B *:9283 running (3d) 105s ago 3D 89.9M - 17.2.3
- Try to enter maintenance mode for node
A
. - cephadm will complain that there would be zero MGR running after maintenance mode. But it is running on node
B
, it is not currently running onA
! In this example both MGR and Grafana services were running onB
and were in error state onA
:ALERT: Cannot stop ['mgr.A.borked'] in Mgr service. Not enough remaining Mgr daemons. Please deploy at least 2 Mgr daemons before stopping ['mgr.A.borked']. WARNING: Stopping 1 out of 1 daemons in Grafana service. Service will not be operational with no daemons left. At least 1 daemon must be running to guarantee service.
It looks like there is some naive algorithm like "IF service should run on this host AND servicecount - 1 IS not ok, THEN deny maintenance".
While the correct way would be "IF service should run on this host AND really is currently running on this host AND servicecount - 1 is not ok, THEN deny maintenance".
I understand there could be much more to the algorithm but the end result is the same, it doesn't work right. cephadm doesn't correctly consider services in "error" state on a machine that has maintenance mode requested and it needs to take the "error" state into account.
Updated by Redouane Kachach Elhichou over 1 year ago
- Assignee set to Redouane Kachach Elhichou
Updated by Redouane Kachach Elhichou about 1 year ago
- Assignee deleted (
Redouane Kachach Elhichou)