Bug #54419
closed`ceph orch upgrade start` seems to never reach completion
0%
Description
Pretty much consistently reproducible here - http://pulpito.front.sepia.ceph.com/yuriw-2022-02-25_15:53:18-fs-wip-yuri11-testing-2022-02-21-0831-quincy-distro-default-smithi/6705843/
Yaml matrix
fs/upgrade/mds_upgrade_sequence/{bluestore-bitmap centos_8.stream_container_tools conf/{client mds mon osd} overrides/{pg-warn syntax whitelist_health whitelist_wrongly_marked_down} roles tasks/{0-from/v16.2.4 1-volume/{0-create 1-ranks/2 2-allow_standby_replay/yes 3-inline/yes 4-verify} 2-client 3-upgrade-with-workload 4-verify}}
Upgrade starts:
2022-02-25T16:20:16.424 DEBUG:teuthology.orchestra.run.smithi133:> sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/ceph:v16.2.4 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 08be78d6-9656-11ec-8c35-001a4aab830c -e sha1=4fba29ce98c0f535f72d6211e12a92b0f5cc66df -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'
This check never seems to reach completion;
- cephadm.shell: env: - sha1 host.a: - while ceph orch upgrade status | jq '.in_progress' | grep true ; do ceph orch ps ; ceph versions ; ceph fs dump; sleep 30 ; done
Last check info (`ceph orch ps`):
2022-02-25T22:34:15.621 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.620+0000 7fec97fff700 1 -- 172.21.15.133:0/2733944680 --> [v2:172.21.15.133:6800/3763011160,v1:172.21.15.133:6801/3763011160] -- mgr_command(tid 0: {"prefix": "orch ps", "target": ["mon-mgr", ""]}) v1 -- 0x7fec980fab10 con 0x7fec80060a40 2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.628+0000 7fec7f7fe700 1 -- 172.21.15.133:0/2733944680 <== mgr.14162 v2:172.21.15.133:6800/3763011160 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+2992 (secure 0 0 0) 0x7fec980f ab10 con 0x7fec80060a40 2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stdout:NAME HOST PORTS STATUS REFRESHED AGE VERSION IMAGE ID CONTAINER ID 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:alertmanager.smithi133 smithi133 *:9093,9094 running (6h) 5m ago 6h 0.20.0 0881eb8f169f 6e5319c197ce 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi133 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 bcb7d2ac9bc5 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi140 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 ff644256fecb 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:grafana.smithi133 smithi133 *:3000 running (6h) 5m ago 6h 6.7.4 557c83e11646 a3ea39cc9870 2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.heswfq smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 4872e1b9c65b 2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.znzevk smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 c7321edf1b47 2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.hsukve smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 a9aca818bda0 2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.kdgefj smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 51be41e99316 2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi133.myobmx smithi133 *:9283 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 2c4687932e0d 2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi140.bjvbbe smithi140 *:8443,9283 running (6h) 3m ago 6h 17.0.0-10430-g4fba29ce 049fbe5af4ba e53ceb73c69d 2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi133 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 119b013df37b 2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi140 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 2b43fb2a6c28 2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi133 smithi133 *:9100 running (6h) 5m ago 6h 0.18.1 e5a616e4b9cf 8c3a40d0e2e7 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi140 smithi140 *:9100 running (6h) 3m ago 6h 0.18.1 e5a616e4b9cf ec3bf7d18486 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.0 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 1fc8dffde333 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.1 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 943fe5d8ce93 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.2 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 700ff7f81ead 2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.3 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 ed20ffd50d9b 2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.4 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 fb188f04ee5f 2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.5 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 ba02f87240e8 2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:prometheus.smithi133 smithi133 *:9095 running (6h) 5m ago 6h 2.18.1 de242295e225 b0a184237a7a
Only one ceph-mgr was upgrade on 17.*, rest ceph daemons are still running 16.2.4 - not sure why.
Updated by Venky Shankar about 2 years ago
Adam,
I did a cursory check for similar issues, but couldn't find any. There is tracker #54411, but that one has MDSs crashing.
MDSs and other daemons are still on 16.2.4 - what could cause this?
Cheers,
Venky
Updated by Venky Shankar about 2 years ago
Adam,
I spent some time looking into this:
Upgrade starts fine with cephadm trying to update the standby ceph-mgr
2022-03-09T14:26:46.050+0000 7fcf96cf6700 4 mgr get_store get_store key: mgr/cephadm/extra_ceph_conf 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999e c4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace'] 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG root] Have connection to smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG root] mgr.smithi174.vklqpz container image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85 2022-03-09T14:26:46.051+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] args: --image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85 deploy --fsid ceaf2912-9fb3-11ec-8c35-001a4aab830c --name mgr.smithi174.vklqpz --met a-json {"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"] } --config-json - --tcp-ports 8443 9283 --allow-ptrace
Here, it probably tries to deploy (and redeploy?) ceph-mgr:
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace'] 2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG root] Have connection to smithi174 ..... ..... ..... ..... ..... 2022-03-09T14:27:16.687+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm 2022-03-09T14:27:17.392+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] code: 0 2022-03-09T14:27:17.392+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] err: Redeploy daemon mgr.smithi174.vklqpz ... 2022-03-09T14:27:17.393+0000 7fcf96cf6700 1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/host.smithi174}] v 0) v1 -- 0x55ff583a4000 con 0x55ff56bb8400
Then, when it comes to upgrading itself, there is no standby ceph-mgr available:
2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.upgrade] Upgrade: Checking mgr daemons 2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm INFO cephadm.upgrade] Upgrade: Need to upgrade myself (mgr.smithi119.czhgre) 2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 log_channel(cephadm) log [INF] : Upgrade: Need to upgrade myself (mgr.smithi119.czhgre) 2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz container digest correct 2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz not deployed by correct version 2022-03-09T14:27:28.828+0000 7fcf96cf6700 0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon 2022-03-09T14:27:28.828+0000 7fcf96cf6700 -1 log_channel(cephadm) log [ERR] : Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon 2022-03-09T14:27:28.828+0000 7fcf96cf6700 1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/upgrade_state}] v 0) v1 -- 0x55ff583a4600 con 0x55ff56bb8400 2022-03-09T14:27:28.838+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm 2022-03-09T14:27:28.838+0000 7fcfc3e79700 20 mgr update_kv_data set mgr/cephadm/upgrade_state = {"target_name": "quay.ceph.io/ceph-ci/ceph:e98697fdcb3b7b8eab3fc453719d4e18f0d62be4", "progress_id": "066fd2ec-6d47-45c0-ad4c-7c87aec0d07f", "target_id": "a26d38fa99d22957938f77f7d65fb1b93b80f520b00ecb8334618c543bd3d3a9", "target_digests": ["quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85"], "target_version": "17.0.0-11006-ge98697fd", "fs_original_max_mds": null, "error": "UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon", "paused": true} 2022-03-09T14:27:28.838+0000 7fcfc3e79700 1 -- 172.21.15.119:0/3384159902 <== mon.0 v2:172.21.15.119:3300/0 1753 ==== mon_command_ack([{prefix=config-key set, key=mgr/cephadm/upgrade_state}]=0 set mgr/cephadm/upgrade_state v134)=0 set mgr/cephadm/upgrade_state v134) v1 ==== 661+0+0 (secure 0 0 0) 0x55ff56c8f1e0 con 0x55ff56bb8400
... and the upgrade is "paused".
The standby mgr seems to be up however:
2022-03-09T14:27:17.003+0000 7fb0753eb000 0 ceph version 17.0.0-11006-ge98697fd (e98697fdcb3b7b8eab3fc453719d4e18f0d62be4) quincy (dev), process ceph-mgr, pid 7 2022-03-09T14:27:17.004+0000 7fb0753eb000 0 pidfile_write: ignore empty --pid-file 2022-03-09T14:27:17.006+0000 7fb0753eb000 1 Processor -- start 2022-03-09T14:27:17.006+0000 7fb0753eb000 1 -- start start ..... ..... ..... ..... 2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr tick tick 2022-03-09T14:27:36.461+0000 7fb06576b700 20 mgr send_beacon standby 2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr send_beacon sending beacon as gid 24457 2022-03-09T14:27:36.462+0000 7fb06576b700 1 -- 172.21.15.174:0/2967250110 --> [v2:172.21.15.174:3300/0,v1:172.21.15.174:6789/0] -- mgrbeacon mgr.smithi174.vklqpz(ceaf2912-9fb3-11ec-8c35-001a4aab830c,24457, , 0) v10 -- 0x55d6ef1c2c80 con 0x55d6e6c5a800
... and continues to send beacon (as standby) till the test times out and daemons are terminated.
I'm not sure what's going on.
Updated by Laura Flores over 1 year ago
@Venky @Adam DC949 is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?
Updated by Adam King over 1 year ago
Laura Flores wrote:
@Venky @Adam DC949 is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?
Most likely, yes. I think this tracker vs. https://tracker.ceph.com/issues/57255 is just how the problem expresses itself before and after https://github.com/ceph/ceph/pull/45361
Updated by Venky Shankar 6 months ago
- Related to Bug #57255: rados/cephadm/mds_upgrade_sequence, pacific : cephadm [ERR] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon added