cephadm: upgrade stuck in repeating sleep when a host is offline
Even though the documentation clearly mentions that all hosts should be online before you initiate an upgrade I nevertheless wanted to see how cephadm reacts when a host is offline in that time (for example a host might unexpectedly go offline during an upgrade due to a hardware failure).
So I started the upgrade in a cluster with two monitor hosts and one osd host offline and what I saw was that cephadm started as usual but then got stuck in a repeating sleep.
2021-02-12T05:52:07.943110+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 151 : cephadm [INF] Upgrade: Checking mgr daemons... 2021-02-12T05:52:07.943532+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 152 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.iz-ceph-v1-mon-02.foqmfa) 2021-02-12T06:02:18.676352+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 461 : cephadm [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.8 with id 5553b0cb212ca2aa220d33ba39d9c602c8412ce6c5febc57ef9cdc9c5844b185 2021-02-12T06:02:18.678718+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 462 : cephadm [INF] Upgrade: Checking mgr daemons... 2021-02-12T06:02:18.679090+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 463 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.iz-ceph-v1-mon-02.foqmfa) 2021-02-12T06:12:28.015778+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 769 : cephadm [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.8 with id 5553b0cb212ca2aa220d33ba39d9c602c8412ce6c5febc57ef9cdc9c5844b185 2021-02-12T06:12:28.018644+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 770 : cephadm [INF] Upgrade: Checking mgr daemons... 2021-02-12T06:12:28.018973+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 771 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.iz-ceph-v1-mon-02.foqmfa) [...] 2021-02-12T09:30:07.704808+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 10423 : cephadm [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.8 with id 5553b0cb212ca2aa220d33ba39d9c602c8412ce6c5febc57ef9cdc9c5844b185 2021-02-12T09:30:07.707638+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 10425 : cephadm [INF] Upgrade: Checking mgr daemons... 2021-02-12T09:30:07.708589+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 10428 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.iz-ceph-v1-mon-02.foqmfa)
Debug shows this:
2021-02-12T09:30:07.709513+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 10431 : cephadm [DBG] Opening connection to root@iz-ceph-v1-mon-04 with ssh options '-F /tmp/cephadm-conf-g22x2uth -i /tmp/cephadm-identity-ydwgvq24' 2021-02-12T09:30:10.790286+0000 mgr.iz-ceph-v1-mon-02.foqmfa (mgr.20214115) 10433 : cephadm [DBG] Sleeping for 600 seconds
A manual ssh connection attempt shows the following error:
ssh: connect to host iz-ceph-v1-mon-04 port 22: No route to host
#2 Updated by Gunther Heinrich 8 days ago
Sebastian Wagner wrote:
Did you verify that the upgrade continues, if the host is online again?
Yes, the upgrade continues when the host is back online
I'm a bit inclined to close this as works-as-intended. An idea would be to validate that all hosts are online when initiating the upgrade.
To be honest, the way the upgrade process currently behaves doesn't seem very satisfactory since it doesn't even inform the admin that a host isn't reachable. It's basically stuck in an endless wait loop without recognizing and catching this (fundamental) exception.I also thought about this and and I'm leaning in the same direction as you:
- When initiating the upgrade it does a quick check of the cluster health (parsing ceph status might be enough?). If there's a warning or an error it doesn't start at all.
- When the upgrade is already in progress and a host then encounters a problem (hardware failure, network failure etc.) the process informs the admin and retries a fixed number of times so the admin might be able to solve the problem. If it still fails the upgrade process pauses while informing the admin.
There might be some cases where either the host is offline for a longer amount of time or the upgrade might be urgent and absolutely neccessary (15.2.7 resolved a potential data loss bug) so the admin might want to upgrade the rest of the cluster nonetheless. For these cases, the upgrade process could offer the option of forcing it via "--ignore-warnings" and "--ignore-errors". Those flags could be set by the admin either when initiating or when resuming an upgrade. For minor upgrades (15.2.4 > 188.8.131.52) the resulting version differences should pose no problem for the cluster.
#4 Updated by Sebastian Wagner 8 days ago
Gunther Heinrich wrote:
There might be some cases where either the host is offline for a longer amount of time or the upgrade might be urgent and absolutely neccessary (15.2.7 resolved a potential data loss bug)
Sounds like a valid use case. I guess a workaround would be to remove the host prior to starting the upgrade. Like
ceph orch host rm my_offline_host ceph orch ugprade .... # make the host reachable again ceph orch host add my_offline_host