Bug #54419: `ceph orch upgrade start` seems to never reach completion - Orchestrator - Ceph

Actions

Copy link

Bug #54419

closed

`ceph orch upgrade start` seems to never reach completion

Added by Venky Shankar about 2 years ago. Updated 6 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Adam King

Category:

cephadm

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

45361

Crash signature (v1):

Crash signature (v2):

Description

Pretty much consistently reproducible here - http://pulpito.front.sepia.ceph.com/yuriw-2022-02-25_15:53:18-fs-wip-yuri11-testing-2022-02-21-0831-quincy-distro-default-smithi/6705843/

Yaml matrix

fs/upgrade/mds_upgrade_sequence/{bluestore-bitmap centos_8.stream_container_tools conf/{client mds mon osd} overrides/{pg-warn syntax whitelist_health whitelist_wrongly_marked_down} roles tasks/{0-from/v16.2.4 1-volume/{0-create 1-ranks/2 2-allow_standby_replay/yes 3-inline/yes 4-verify} 2-client 3-upgrade-with-workload 4-verify}}

Upgrade starts:

2022-02-25T16:20:16.424 DEBUG:teuthology.orchestra.run.smithi133:> sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/ceph:v16.2.4 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 08be78d6-9656-11ec-8c35-001a4aab830c -e sha1=4fba29ce98c0f535f72d6211e12a92b0f5cc66df -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'

This check never seems to reach completion;

    - cephadm.shell:
        env:
        - sha1
        host.a:
        - while ceph orch upgrade status | jq '.in_progress' | grep true ; do ceph orch ps ; ceph versions ; ceph fs dump; sleep 30 ; done

Last check info (`ceph orch ps`):

2022-02-25T22:34:15.621 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.620+0000 7fec97fff700  1 -- 172.21.15.133:0/2733944680 --> [v2:172.21.15.133:6800/3763011160,v1:172.21.15.133:6801/3763011160] -- mgr_command(tid 0: {"prefix": "orch ps", "target":
 ["mon-mgr", ""]}) v1 -- 0x7fec980fab10 con 0x7fec80060a40
2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.628+0000 7fec7f7fe700  1 -- 172.21.15.133:0/2733944680 <== mgr.14162 v2:172.21.15.133:6800/3763011160 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+2992 (secure 0 0 0) 0x7fec980f
ab10 con 0x7fec80060a40
2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stdout:NAME                         HOST       PORTS        STATUS        REFRESHED  AGE  VERSION                 IMAGE ID      CONTAINER ID
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:alertmanager.smithi133       smithi133  *:9093,9094  running (6h)  5m ago     6h   0.20.0                  0881eb8f169f  6e5319c197ce
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi133              smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  bcb7d2ac9bc5
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi140              smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  ff644256fecb
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:grafana.smithi133            smithi133  *:3000       running (6h)  5m ago     6h   6.7.4                   557c83e11646  a3ea39cc9870
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.heswfq  smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  4872e1b9c65b
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.znzevk  smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  c7321edf1b47
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.hsukve  smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  a9aca818bda0
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.kdgefj  smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  51be41e99316
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi133.myobmx         smithi133  *:9283       running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  2c4687932e0d
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi140.bjvbbe         smithi140  *:8443,9283  running (6h)  3m ago     6h   17.0.0-10430-g4fba29ce  049fbe5af4ba  e53ceb73c69d
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi133                smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  119b013df37b
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi140                smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  2b43fb2a6c28
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi133      smithi133  *:9100       running (6h)  5m ago     6h   0.18.1                  e5a616e4b9cf  8c3a40d0e2e7
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi140      smithi140  *:9100       running (6h)  3m ago     6h   0.18.1                  e5a616e4b9cf  ec3bf7d18486
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.0                        smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  1fc8dffde333
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.1                        smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  943fe5d8ce93
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.2                        smithi133               running (6h)  5m ago     6h   16.2.4                  8d91d370c2b8  700ff7f81ead
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.3                        smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  ed20ffd50d9b
2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.4                        smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  fb188f04ee5f
2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.5                        smithi140               running (6h)  3m ago     6h   16.2.4                  8d91d370c2b8  ba02f87240e8
2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:prometheus.smithi133         smithi133  *:9095       running (6h)  5m ago     6h   2.18.1                  de242295e225  b0a184237a7a

Only one ceph-mgr was upgrade on 17.*, rest ceph daemons are still running 16.2.4 - not sure why.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Adam,

I did a cursory check for similar issues, but couldn't find any. There is tracker #54411, but that one has MDSs crashing.

MDSs and other daemons are still on 16.2.4 - what could cause this?

Cheers,
Venky

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Adam,

I spent some time looking into this:

Upgrade starts fine with cephadm trying to update the standby ceph-mgr

2022-03-09T14:26:46.050+0000 7fcf96cf6700  4 mgr get_store get_store key: mgr/cephadm/extra_ceph_conf
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999e
c4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace']
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG root] Have connection to smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG root] mgr.smithi174.vklqpz container image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85
2022-03-09T14:26:46.051+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] args: --image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85 deploy --fsid ceaf2912-9fb3-11ec-8c35-001a4aab830c --name mgr.smithi174.vklqpz --met
a-json {"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]
} --config-json - --tcp-ports 8443 9283 --allow-ptrace

Here, it probably tries to deploy (and redeploy?) ceph-mgr:

2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace']
2022-03-09T14:26:46.050+0000 7fcf96cf6700  0 [cephadm DEBUG root] Have connection to smithi174
.....
.....
.....
.....
.....
2022-03-09T14:27:16.687+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm
2022-03-09T14:27:17.392+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] code: 0
2022-03-09T14:27:17.392+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.serve] err: Redeploy daemon mgr.smithi174.vklqpz ...
2022-03-09T14:27:17.393+0000 7fcf96cf6700  1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/host.smithi174}] v 0) v1 -- 0x55ff583a4000 con 0x55ff56bb8400

Then, when it comes to upgrading itself, there is no standby ceph-mgr available:

2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.upgrade] Upgrade: Checking mgr daemons
2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 [cephadm INFO cephadm.upgrade] Upgrade: Need to upgrade myself (mgr.smithi119.czhgre)
2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 log_channel(cephadm) log [INF] : Upgrade: Need to upgrade myself (mgr.smithi119.czhgre)
2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz container digest correct
2022-03-09T14:27:28.827+0000 7fcf96cf6700  0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz not deployed by correct version
2022-03-09T14:27:28.828+0000 7fcf96cf6700  0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
2022-03-09T14:27:28.828+0000 7fcf96cf6700 -1 log_channel(cephadm) log [ERR] : Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
2022-03-09T14:27:28.828+0000 7fcf96cf6700  1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/upgrade_state}] v 0) v1 -- 0x55ff583a4600 con 0x55ff56bb8400
2022-03-09T14:27:28.838+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm
2022-03-09T14:27:28.838+0000 7fcfc3e79700 20 mgr update_kv_data  set mgr/cephadm/upgrade_state = {"target_name": "quay.ceph.io/ceph-ci/ceph:e98697fdcb3b7b8eab3fc453719d4e18f0d62be4", "progress_id": "066fd2ec-6d47-45c0-ad4c-7c87aec0d07f", "target_id": "a26d38fa99d22957938f77f7d65fb1b93b80f520b00ecb8334618c543bd3d3a9", "target_digests": ["quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85"], "target_version": "17.0.0-11006-ge98697fd", "fs_original_max_mds": null, "error": "UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon", "paused": true}
2022-03-09T14:27:28.838+0000 7fcfc3e79700  1 -- 172.21.15.119:0/3384159902 <== mon.0 v2:172.21.15.119:3300/0 1753 ==== mon_command_ack([{prefix=config-key set, key=mgr/cephadm/upgrade_state}]=0 set mgr/cephadm/upgrade_state v134)=0 set mgr/cephadm/upgrade_state v134) v1 ==== 661+0+0 (secure 0 0 0) 0x55ff56c8f1e0 con 0x55ff56bb8400

... and the upgrade is "paused".

The standby mgr seems to be up however:

2022-03-09T14:27:17.003+0000 7fb0753eb000  0 ceph version 17.0.0-11006-ge98697fd (e98697fdcb3b7b8eab3fc453719d4e18f0d62be4) quincy (dev), process ceph-mgr, pid 7
2022-03-09T14:27:17.004+0000 7fb0753eb000  0 pidfile_write: ignore empty --pid-file
2022-03-09T14:27:17.006+0000 7fb0753eb000  1  Processor -- start
2022-03-09T14:27:17.006+0000 7fb0753eb000  1 --  start start
.....
.....
.....
.....
2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr tick tick
2022-03-09T14:27:36.461+0000 7fb06576b700 20 mgr send_beacon standby
2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr send_beacon sending beacon as gid 24457
2022-03-09T14:27:36.462+0000 7fb06576b700  1 -- 172.21.15.174:0/2967250110 --> [v2:172.21.15.174:3300/0,v1:172.21.15.174:6789/0] -- mgrbeacon mgr.smithi174.vklqpz(ceaf2912-9fb3-11ec-8c35-001a4aab830c,24457, , 0) v10 -- 0x55d6ef1c2c80 con 0x55d6e6c5a800

... and continues to send beacon (as standby) till the test times out and daemons are terminated.

I'm not sure what's going on.

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Pull request ID set to 45361

Actions

Copy link

Updated by Laura Flores over 1 year ago

@Venky @Adam DC949 is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?

Actions

Copy link

Updated by Adam King over 1 year ago

Laura Flores wrote:

@Venky @Adam DC949 is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?

Most likely, yes. I think this tracker vs. https://tracker.ceph.com/issues/57255 is just how the problem expresses itself before and after https://github.com/ceph/ceph/pull/45361

Actions

Copy link

Updated by Venky Shankar 6 months ago

Related to Bug #57255: rados/cephadm/mds_upgrade_sequence, pacific : cephadm [ERR] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon added

Actions

Copy link

Updated by Venky Shankar 6 months ago

Status changed from New to Duplicate

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #54419

`ceph orch upgrade start` seems to never reach completion

Updated by Venky Shankar about 2 years ago

Updated by Venky Shankar about 2 years ago

Updated by Venky Shankar about 2 years ago

Updated by Laura Flores over 1 year ago

Updated by Adam King over 1 year ago

Updated by Venky Shankar 6 months ago

Updated by Venky Shankar 6 months ago