Bug #54419
`ceph orch upgrade start` seems to never reach completion
0%
Description
Pretty much consistently reproducible here - http://pulpito.front.sepia.ceph.com/yuriw-2022-02-25_15:53:18-fs-wip-yuri11-testing-2022-02-21-0831-quincy-distro-default-smithi/6705843/
Yaml matrix
fs/upgrade/mds_upgrade_sequence/{bluestore-bitmap centos_8.stream_container_tools conf/{client mds mon osd} overrides/{pg-warn syntax whitelist_health whitelist_wrongly_marked_down} roles tasks/{0-from/v16.2.4 1-volume/{0-create 1-ranks/2 2-allow_standby_replay/yes 3-inline/yes 4-verify} 2-client 3-upgrade-with-workload 4-verify}}
Upgrade starts:
2022-02-25T16:20:16.424 DEBUG:teuthology.orchestra.run.smithi133:> sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/ceph:v16.2.4 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 08be78d6-9656-11ec-8c35-001a4aab830c -e sha1=4fba29ce98c0f535f72d6211e12a92b0f5cc66df -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'
This check never seems to reach completion;
- cephadm.shell: env: - sha1 host.a: - while ceph orch upgrade status | jq '.in_progress' | grep true ; do ceph orch ps ; ceph versions ; ceph fs dump; sleep 30 ; done
Last check info (`ceph orch ps`):
2022-02-25T22:34:15.621 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.620+0000 7fec97fff700 1 -- 172.21.15.133:0/2733944680 --> [v2:172.21.15.133:6800/3763011160,v1:172.21.15.133:6801/3763011160] -- mgr_command(tid 0: {"prefix": "orch ps", "target": ["mon-mgr", ""]}) v1 -- 0x7fec980fab10 con 0x7fec80060a40 2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.628+0000 7fec7f7fe700 1 -- 172.21.15.133:0/2733944680 <== mgr.14162 v2:172.21.15.133:6800/3763011160 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+2992 (secure 0 0 0) 0x7fec980f ab10 con 0x7fec80060a40 2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stdout:NAME HOST PORTS STATUS REFRESHED AGE VERSION IMAGE ID CONTAINER ID 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:alertmanager.smithi133 smithi133 *:9093,9094 running (6h) 5m ago 6h 0.20.0 0881eb8f169f 6e5319c197ce 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi133 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 bcb7d2ac9bc5 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi140 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 ff644256fecb 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:grafana.smithi133 smithi133 *:3000 running (6h) 5m ago 6h 6.7.4 557c83e11646 a3ea39cc9870 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.heswfq smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 4872e1b9c65b 2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.znzevk smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 c7321edf1b47 2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.hsukve smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 a9aca818bda0 2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.kdgefj smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 51be41e99316 2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi133.myobmx smithi133 *:9283 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 2c4687932e0d 2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi140.bjvbbe smithi140 *:8443,9283 running (6h) 3m ago 6h 17.0.0-10430-g4fba29ce 049fbe5af4ba e53ceb73c69d 2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi133 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 119b013df37b 2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi140 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 2b43fb2a6c28 2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi133 smithi133 *:9100 running (6h) 5m ago 6h 0.18.1 e5a616e4b9cf 8c3a40d0e2e7 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi140 smithi140 *:9100 running (6h) 3m ago 6h 0.18.1 e5a616e4b9cf ec3bf7d18486 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.0 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 1fc8dffde333 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.1 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 943fe5d8ce93 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.2 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 700ff7f81ead 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.3 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 ed20ffd50d9b 2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.4 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 fb188f04ee5f 2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.5 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 ba02f87240e8 2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:prometheus.smithi133 smithi133 *:9095 running (6h) 5m ago 6h 2.18.1 de242295e225 b0a184237a7a
Only one ceph-mgr was upgrade on 17.*, rest ceph daemons are still running 16.2.4 - not sure why.
History
#1 Updated by Venky Shankar 11 months ago
Adam,
I did a cursory check for similar issues, but couldn't find any. There is tracker #54411, but that one has MDSs crashing.
MDSs and other daemons are still on 16.2.4 - what could cause this?
Cheers,
Venky
#2 Updated by Venky Shankar 11 months ago
Adam,
I spent some time looking into this:
Upgrade starts fine with cephadm trying to update the standby ceph-mgr
2022-03-09T14:26:46.050+0000 7fcf96cf6700 4 mgr get_store get_store key: mgr/cephadm/extra_ceph_conf 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999e c4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace'] 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG root] Have connection to smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG root] mgr.smithi174.vklqpz container image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85 2022-03-09T14:26:46.051+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] args: --image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85 deploy --fsid ceaf2912-9fb3-11ec-8c35-001a4aab830c --name mgr.smithi174.vklqpz --met a-json {"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"] } --config-json - --tcp-ports 8443 9283 --allow-ptrace
Here, it probably tries to deploy (and redeploy?) ceph-mgr:
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace'] 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG root] Have connection to smithi174 ..... ..... ..... ..... ..... 2022-03-09T14:27:16.687+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm 2022-03-09T14:27:17.392+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] code: 0 2022-03-09T14:27:17.392+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] err: Redeploy daemon mgr.smithi174.vklqpz ... 2022-03-09T14:27:17.393+0000 7fcf96cf6700 1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/host.smithi174}] v 0) v1 -- 0x55ff583a4000 con 0x55ff56bb8400
Then, when it comes to upgrading itself, there is no standby ceph-mgr available:
2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.upgrade] Upgrade: Checking mgr daemons 2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm INFO cephadm.upgrade] Upgrade: Need to upgrade myself (mgr.smithi119.czhgre) 2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 log_channel(cephadm) log [INF] : Upgrade: Need to upgrade myself (mgr.smithi119.czhgre) 2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz container digest correct 2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz not deployed by correct version 2022-03-09T14:27:28.828+0000 7fcf96cf6700 0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon 2022-03-09T14:27:28.828+0000 7fcf96cf6700 -1 log_channel(cephadm) log [ERR] : Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon 2022-03-09T14:27:28.828+0000 7fcf96cf6700 1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/upgrade_state}] v 0) v1 -- 0x55ff583a4600 con 0x55ff56bb8400 2022-03-09T14:27:28.838+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm 2022-03-09T14:27:28.838+0000 7fcfc3e79700 20 mgr update_kv_data set mgr/cephadm/upgrade_state = {"target_name": "quay.ceph.io/ceph-ci/ceph:e98697fdcb3b7b8eab3fc453719d4e18f0d62be4", "progress_id": "066fd2ec-6d47-45c0-ad4c-7c87aec0d07f", "target_id": "a26d38fa99d22957938f77f7d65fb1b93b80f520b00ecb8334618c543bd3d3a9", "target_digests": ["quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85"], "target_version": "17.0.0-11006-ge98697fd", "fs_original_max_mds": null, "error": "UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon", "paused": true} 2022-03-09T14:27:28.838+0000 7fcfc3e79700 1 -- 172.21.15.119:0/3384159902 <== mon.0 v2:172.21.15.119:3300/0 1753 ==== mon_command_ack([{prefix=config-key set, key=mgr/cephadm/upgrade_state}]=0 set mgr/cephadm/upgrade_state v134)=0 set mgr/cephadm/upgrade_state v134) v1 ==== 661+0+0 (secure 0 0 0) 0x55ff56c8f1e0 con 0x55ff56bb8400
... and the upgrade is "paused".
The standby mgr seems to be up however:
2022-03-09T14:27:17.003+0000 7fb0753eb000 0 ceph version 17.0.0-11006-ge98697fd (e98697fdcb3b7b8eab3fc453719d4e18f0d62be4) quincy (dev), process ceph-mgr, pid 7 2022-03-09T14:27:17.004+0000 7fb0753eb000 0 pidfile_write: ignore empty --pid-file 2022-03-09T14:27:17.006+0000 7fb0753eb000 1 Processor -- start 2022-03-09T14:27:17.006+0000 7fb0753eb000 1 -- start start ..... ..... ..... ..... 2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr tick tick 2022-03-09T14:27:36.461+0000 7fb06576b700 20 mgr send_beacon standby 2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr send_beacon sending beacon as gid 24457 2022-03-09T14:27:36.462+0000 7fb06576b700 1 -- 172.21.15.174:0/2967250110 --> [v2:172.21.15.174:3300/0,v1:172.21.15.174:6789/0] -- mgrbeacon mgr.smithi174.vklqpz(ceaf2912-9fb3-11ec-8c35-001a4aab830c,24457, , 0) v10 -- 0x55d6ef1c2c80 con 0x55d6e6c5a800
... and continues to send beacon (as standby) till the test times out and daemons are terminated.
I'm not sure what's going on.
#3 Updated by Venky Shankar 11 months ago
- Pull request ID set to 45361
#4 Updated by Laura Flores 5 months ago
@Venky @Adam is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?
#5 Updated by Adam King 5 months ago
Laura Flores wrote:
@Venky @Adam is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?
Most likely, yes. I think this tracker vs. https://tracker.ceph.com/issues/57255 is just how the problem expresses itself before and after https://github.com/ceph/ceph/pull/45361