Bug #47694
closeddowngrading via ceph orch upgrade start results in partial application and mixed state
0%
Description
Following https://docs.ceph.com/en/latest/cephadm/upgrade/#using-customized-container-images I attempted to downgrade my cluster.
The process starts fine but I end up in a weird state with two mgr daemons downgraded, the upgrade seemingly succeeded and a HEALTH_WARN.
Starting with a healthy cluster at version 15.2.5-220-gb758bfd693 (SUSE downstream container) I run ceph orch upgrade start --image <custome registry url>/containers/ses/7/containers/ses/7/ceph/ceph:15.2.0.108
. This starts the process alright and I can see the progress of the image pull in ceph -s.
After a while this finishes and left the cluster in the following state:
master:~ # ceph versions { "mon": { "ceph version 15.2.5-220-gb758bfd693 (b758bfd69359a0ffa10bd5426d64e7636bb0a6c6) octopus (stable)": 3 }, "mgr": { "ceph version 15.2.0-108-g8cf4f02b08 (8cf4f02b0814fc5dc803ae5923cb310bb08de967) octopus (stable)": 2, "ceph version 15.2.5-220-gb758bfd693 (b758bfd69359a0ffa10bd5426d64e7636bb0a6c6) octopus (stable)": 1 }, "osd": { "ceph version 15.2.5-220-gb758bfd693 (b758bfd69359a0ffa10bd5426d64e7636bb0a6c6) octopus (stable)": 20 }, "mds": { "ceph version 15.2.5-220-gb758bfd693 (b758bfd69359a0ffa10bd5426d64e7636bb0a6c6) octopus (stable)": 2 }, "overall": { "ceph version 15.2.0-108-g8cf4f02b08 (8cf4f02b0814fc5dc803ae5923cb310bb08de967) octopus (stable)": 2, "ceph version 15.2.5-220-gb758bfd693 (b758bfd69359a0ffa10bd5426d64e7636bb0a6c6) octopus (stable)": 26 } } master:~ # ceph -s cluster: id: 2f578f24-02e5-11eb-92b7-52540064363c health: HEALTH_WARN 4 hosts fail cephadm check failed to probe daemons or devices 28 stray daemons(s) not managed by cephadm services: mon: 3 daemons, quorum master,node1,node2 (age 2h) mgr: node2.ibtqev(active, since 4m), standbys: master.wdjpkv, node1.cgixgj mds: sesdev_fs:1 {0=sesdev_fs.node3.jwnsyq=up:active} 1 up:standby osd: 20 osds: 20 up (since 2h), 20 in (since 2h) task status: scrub status: mds.sesdev_fs.node3.jwnsyq: idle data: pools: 3 pools, 65 pgs objects: 22 objects, 2.8 KiB usage: 20 GiB used, 140 GiB / 160 GiB avail pgs: 65 active+clean
The current active mgr could not be failed.
master:~ # ceph mgr fail ibtqev Daemon not found 'ibtqev', already failed
I'm aware the upgrade command should probably not expected to handle a downgrade. I think some validation should probably be done to avoid this situation, if only to avoid users running into issues due to mistyping.