Project

General

Profile

Bug #54419

`ceph orch upgrade start` seems to never reach completion

Added by Venky Shankar 11 months ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Pretty much consistently reproducible here - http://pulpito.front.sepia.ceph.com/yuriw-2022-02-25_15:53:18-fs-wip-yuri11-testing-2022-02-21-0831-quincy-distro-default-smithi/6705843/

Yaml matrix

fs/upgrade/mds_upgrade_sequence/{bluestore-bitmap centos_8.stream_container_tools conf/{client mds mon osd} overrides/{pg-warn syntax whitelist_health whitelist_wrongly_marked_down} roles tasks/{0-from/v16.2.4 1-volume/{0-create 1-ranks/2 2-allow_standby_replay/yes 3-inline/yes 4-verify} 2-client 3-upgrade-with-workload 4-verify}}

Upgrade starts:

2022-02-25T16:20:16.424 DEBUG:teuthology.orchestra.run.smithi133:> sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/ceph:v16.2.4 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 08be78d6-9656-11ec-8c35-001a4aab830c -e sha1=4fba29ce98c0f535f72d6211e12a92b0f5cc66df -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'

This check never seems to reach completion;

    - cephadm.shell:
        env:
        - sha1
        host.a:
        - while ceph orch upgrade status | jq '.in_progress' | grep true ; do ceph orch ps ; ceph versions ; ceph fs dump; sleep 30 ; done

Last check info (`ceph orch ps`):

2022-02-25T22:34:15.621 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.620+0000 7fec97fff700  1 -- 172.21.15.133:0/2733944680 --> [v2:172.21.15.133:6800/3763011160,v1:172.21.15.133:6801/3763011160] -- mgr_command(tid 0: {"prefix": "orch ps", "target":
 ["mon-mgr", ""]}) v1 -- 0x7fec980fab10 con 0x7fec80060a40
2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.628+0000 7fec7f7fe700  1 -- 172.21.15.133:0/2733944680 <== mgr.14162 v2:172.21.15.133:6800/3763011160 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+2992 (secure 0 0 0) 0x7fec980f
ab10 con 0x7fec80060a40
2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stdout:NAME                         HOST       PORTS        STATUS        REFRESHED  AGE  VERSION                 IMAGE ID      CONTAINER ID
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:alertmanager.smithi133       smithi133  *:9093,9094  running (6h)  5m ago     6h   0.20.0                  0881eb8f169f  6e5319c197ce
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi133              smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  bcb7d2ac9bc5
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi140              smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  ff644256fecb
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:grafana.smithi133            smithi133  *:3000       running (6h)  5m ago     6h   6.7.4                   557c83e11646  a3ea39cc9870
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.heswfq  smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  4872e1b9c65b
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.znzevk  smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  c7321edf1b47
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.hsukve  smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  a9aca818bda0
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.kdgefj  smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  51be41e99316
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi133.myobmx         smithi133  *:9283       running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  2c4687932e0d
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi140.bjvbbe         smithi140  *:8443,9283  running (6h)  3m ago     6h   17.0.0-10430-g4fba29ce  049fbe5af4ba  e53ceb73c69d
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi133                smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  119b013df37b
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi140                smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  2b43fb2a6c28
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi133      smithi133  *:9100       running (6h)  5m ago     6h   0.18.1                  e5a616e4b9cf  8c3a40d0e2e7
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi140      smithi140  *:9100       running (6h)  3m ago     6h   0.18.1                  e5a616e4b9cf  ec3bf7d18486
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.0                        smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  1fc8dffde333
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.1                        smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  943fe5d8ce93
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.2                        smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  700ff7f81ead
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.3                        smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  ed20ffd50d9b
2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.4                        smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  fb188f04ee5f
2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.5                        smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  ba02f87240e8
2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:prometheus.smithi133         smithi133  *:9095       running (6h)  5m ago     6h   2.18.1                  de242295e225  b0a184237a7a

Only one ceph-mgr was upgrade on 17.*, rest ceph daemons are still running 16.2.4 - not sure why.

History

#1 Updated by Venky Shankar 11 months ago

Adam,

I did a cursory check for similar issues, but couldn't find any. There is tracker #54411, but that one has MDSs crashing.

MDSs and other daemons are still on 16.2.4 - what could cause this?

Cheers,
Venky

#2 Updated by Venky Shankar 11 months ago

Adam,

I spent some time looking into this:

Upgrade starts fine with cephadm trying to update the standby ceph-mgr

2022-03-09T14:26:46.050+0000 7fcf96cf6700  4 mgr get_store get_store key: mgr/cephadm/extra_ceph_conf
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999e
c4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace']
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG root] Have connection to smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG root] mgr.smithi174.vklqpz container image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85
2022-03-09T14:26:46.051+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] args: --image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85 deploy --fsid ceaf2912-9fb3-11ec-8c35-001a4aab830c --name mgr.smithi174.vklqpz --met
a-json {"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]
} --config-json - --tcp-ports 8443 9283 --allow-ptrace

Here, it probably tries to deploy (and redeploy?) ceph-mgr:

2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace']
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG root] Have connection to smithi174
.....
.....
.....
.....
.....
2022-03-09T14:27:16.687+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm
2022-03-09T14:27:17.392+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] code: 0
2022-03-09T14:27:17.392+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] err: Redeploy daemon mgr.smithi174.vklqpz ...
2022-03-09T14:27:17.393+0000 7fcf96cf6700  1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/host.smithi174}] v 0) v1 -- 0x55ff583a4000 con 0x55ff56bb8400

Then, when it comes to upgrading itself, there is no standby ceph-mgr available:

2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.upgrade] Upgrade: Checking mgr daemons
2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 [cephadm INFO cephadm.upgrade] Upgrade: Need to upgrade myself (mgr.smithi119.czhgre)
2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 log_channel(cephadm) log [INF] : Upgrade: Need to upgrade myself (mgr.smithi119.czhgre)
2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz container digest correct
2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz not deployed by correct version
2022-03-09T14:27:28.828+0000 7fcf96cf6700  0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
2022-03-09T14:27:28.828+0000 7fcf96cf6700 -1 log_channel(cephadm) log [ERR] : Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
2022-03-09T14:27:28.828+0000 7fcf96cf6700  1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/upgrade_state}] v 0) v1 -- 0x55ff583a4600 con 0x55ff56bb8400
2022-03-09T14:27:28.838+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm
2022-03-09T14:27:28.838+0000 7fcfc3e79700 20 mgr update_kv_data  set mgr/cephadm/upgrade_state = {"target_name": "quay.ceph.io/ceph-ci/ceph:e98697fdcb3b7b8eab3fc453719d4e18f0d62be4", "progress_id": "066fd2ec-6d47-45c0-ad4c-7c87aec0d07f", "target_id": "a26d38fa99d22957938f77f7d65fb1b93b80f520b00ecb8334618c543bd3d3a9", "target_digests": ["quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85"], "target_version": "17.0.0-11006-ge98697fd", "fs_original_max_mds": null, "error": "UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon", "paused": true}
2022-03-09T14:27:28.838+0000 7fcfc3e79700  1 -- 172.21.15.119:0/3384159902 <== mon.0 v2:172.21.15.119:3300/0 1753 ==== mon_command_ack([{prefix=config-key set, key=mgr/cephadm/upgrade_state}]=0 set mgr/cephadm/upgrade_state v134)=0 set mgr/cephadm/upgrade_state v134) v1 ==== 661+0+0 (secure 0 0 0) 0x55ff56c8f1e0 con 0x55ff56bb8400

... and the upgrade is "paused".

The standby mgr seems to be up however:

2022-03-09T14:27:17.003+0000 7fb0753eb000  0 ceph version 17.0.0-11006-ge98697fd (e98697fdcb3b7b8eab3fc453719d4e18f0d62be4) quincy (dev), process ceph-mgr, pid 7
2022-03-09T14:27:17.004+0000 7fb0753eb000  0 pidfile_write: ignore empty --pid-file
2022-03-09T14:27:17.006+0000 7fb0753eb000  1  Processor -- start
2022-03-09T14:27:17.006+0000 7fb0753eb000  1 --  start start
.....
.....
.....
.....
2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr tick tick
2022-03-09T14:27:36.461+0000 7fb06576b700 20 mgr send_beacon standby
2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr send_beacon sending beacon as gid 24457
2022-03-09T14:27:36.462+0000 7fb06576b700  1 -- 172.21.15.174:0/2967250110 --> [v2:172.21.15.174:3300/0,v1:172.21.15.174:6789/0] -- mgrbeacon mgr.smithi174.vklqpz(ceaf2912-9fb3-11ec-8c35-001a4aab830c,24457, , 0) v10 -- 0x55d6ef1c2c80 con 0x55d6e6c5a800

... and continues to send beacon (as standby) till the test times out and daemons are terminated.

I'm not sure what's going on.

#3 Updated by Venky Shankar 11 months ago

  • Pull request ID set to 45361

#4 Updated by Laura Flores 5 months ago

@Venky @Adam is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?

#5 Updated by Adam King 5 months ago

Laura Flores wrote:

@Venky @Adam is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?

Most likely, yes. I think this tracker vs. https://tracker.ceph.com/issues/57255 is just how the problem expresses itself before and after https://github.com/ceph/ceph/pull/45361

Also available in: Atom PDF