Laura Flores wrote:
Based on the error message in the mgr log, it looks like there is a problem with the mgr version:
/a/yuriw-2022-08-19_20:57:42-rados-wip-yuri6-testing-2022-08-19-0940-pacific-distro-default-smithi/6981184/remote/ubuntu@smithi124.front.sepia.ceph.com/log/2641a2e4-2007-11ed-8431-001a4aab830c/ceph-mgr.smithi124.pdzpkv.log.gz
[...]
There is no issue with mgr version, the problem is that there was no standby mgr in mgrmap when active mgr was getting upgraded.
mgrmap during job run :
$ zcat ./remote/ubuntu@smithi124.front.sepia.ceph.com/log/2641a2e4-2007-11ed-8431-001a4aab830c/ceph-mgr.smithi124.pdzpkv.log.gz|grep got_mgr_map
2022-08-19T21:38:11.413+0000 7f2849fc7700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.00365193s)
2022-08-19T21:38:12.413+0000 7f2849fc7700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.00326s)
2022-08-19T21:38:19.421+0000 7f8fb401e700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.0544349s)
2022-08-19T21:38:20.428+0000 7f8fb401e700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.06162s)
2022-08-19T21:38:21.431+0000 7f8fb401e700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 2s)
2022-08-19T21:38:43.105+0000 7f3b4fbdb700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.0547095s)
2022-08-19T21:38:44.104+0000 7f3b4fbdb700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.05378s)
2022-08-19T21:38:45.975+0000 7f3b4fbdb700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 2s)
2022-08-19T21:38:55.946+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.0546631s)
2022-08-19T21:38:56.950+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.05845s)
2022-08-19T21:38:58.851+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 2s)
2022-08-19T21:39:02.757+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 6s)
2022-08-19T21:40:00.667+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 64s), standbys: smithi138.iazeci
2022-08-19T21:45:14.974+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 6m)
2022-08-19T21:45:23.100+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 6m), standbys: smithi138.iazeci
2022-08-19T21:47:21.093+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 8m), standbys: smithi138.iazeci
Let's track down why there was no standby around 2022-08-19T21:45:14 and till 2022-08-19T21:45:23
Upgrade started at 2022-08-19T21:43:31
2022-08-19T21:43:31.906 DEBUG:teuthology.orchestra.run.smithi124:> sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/ceph:v16.2.4 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 2641a2e4-2007-11ed-8431-001a4aab830c -e sha1=67571cd8e74ee004835e626e065891d2f91abffa -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'
Upgrade yet to be initiated for mgr. Both mgr were running at 2022-08-19T21:43:36 :
2022-08-19T21:43:36.763 INFO:teuthology.orchestra.run.smithi124.stderr:2022-08-19T21:43:36.760+0000 7f3443fff700 1 -- 172.21.15.124:0/343422277 <== mgr.14162 v2:172.21.15.124:6800/2654071450 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+2728 (secure 0 0 0) 0x7f345c057f10 con 0x7f344405c300
2022-08-19T21:43:36.764 INFO:teuthology.orchestra.run.smithi124.stdout:NAME HOST PORTS STATUS REFRESHED AGE VERSION IMAGE ID CONTAINER ID
2022-08-19T21:43:36.764 INFO:teuthology.orchestra.run.smithi124.stdout:alertmanager.smithi124 smithi124 *:9093,9094 running (3m) 16s ago 4m 0.20.0 0881eb8f169f 11f1d2920b71
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:crash.smithi124 smithi124 running (4m) 16s ago 4m 16.2.4 8d91d370c2b8 717b26481548
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:crash.smithi138 smithi138 running (3m) 19s ago 3m 16.2.4 8d91d370c2b8 979a5d1dac14
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:grafana.smithi124 smithi124 *:3000 running (3m) 16s ago 4m 6.7.4 557c83e11646 15bd9e7ccae6
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi124.dvbrra smithi124 running (28s) 16s ago 28s 16.2.4 8d91d370c2b8 3b8d2b3a0d7c
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi124.vduioy smithi124 running (33s) 16s ago 32s 16.2.4 8d91d370c2b8 f667e53107a1
2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi138.ekgwds smithi138 running (31s) 19s ago 30s 16.2.4 8d91d370c2b8 b601f18ef5b9
2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi138.kxgbaf smithi138 running (35s) 19s ago 35s 16.2.4 8d91d370c2b8 272a9b77d8f9
2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mgr.smithi124.pdzpkv smithi124 *:9283 running (5m) 16s ago 5m 16.2.4 8d91d370c2b8 5e030cba4b77
2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mgr.smithi138.iazeci smithi138 *:8443,9283 running (3m) 19s ago 3m 16.2.4 8d91d370c2b8 b53c1efe18c1
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:mon.smithi124 smithi124 running (5m) 16s ago 5m 16.2.4 8d91d370c2b8 b6fde5c9302e
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:mon.smithi138 smithi138 running (3m) 19s ago 3m 16.2.4 8d91d370c2b8 820d1ed4e18a
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:node-exporter.smithi124 smithi124 *:9100 running (4m) 16s ago 4m 0.18.1 e5a616e4b9cf a45067e8714e
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:node-exporter.smithi138 smithi138 *:9100 running (3m) 19s ago 3m 0.18.1 e5a616e4b9cf 21ba865f8a0f
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:osd.0 smithi124 running (3m) 16s ago 3m 16.2.4 8d91d370c2b8 c10bc4caeb1b
2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.1 smithi124 running (2m) 16s ago 2m 16.2.4 8d91d370c2b8 b75a42c340ec
2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.2 smithi124 running (2m) 16s ago 2m 16.2.4 8d91d370c2b8 24f778d4d61b
2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.3 smithi138 running (2m) 19s ago 2m 16.2.4 8d91d370c2b8 749e01c0fd84
2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.4 smithi138 running (2m) 19s ago 2m 16.2.4 8d91d370c2b8 0676f04ba553
2022-08-19T21:43:36.769 INFO:teuthology.orchestra.run.smithi124.stdout:osd.5 smithi138 running (109s) 19s ago 108s 16.2.4 8d91d370c2b8 06ce538c727f
2022-08-19T21:43:36.769 INFO:teuthology.orchestra.run.smithi124.stdout:prometheus.smithi124 smithi124 *:9095 running (3m) 16s ago 4m 2.18.1 de242295e225 02898ed8a2a5
...
...
2022-08-19T21:43:37.194 INFO:teuthology.orchestra.run.smithi124.stdout:{
2022-08-19T21:43:37.194 INFO:teuthology.orchestra.run.smithi124.stdout: "mon": {
2022-08-19T21:43:37.194 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 2
2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout: },
2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout: "mgr": {
2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 2
2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout: },
2022-08-19T21:43:37.196 INFO:teuthology.orchestra.run.smithi124.stdout: "osd": {
2022-08-19T21:43:37.196 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 6
2022-08-19T21:43:37.196 INFO:teuthology.orchestra.run.smithi124.stdout: },
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: "mds": {
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 4
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: },
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: "overall": {
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 14
2022-08-19T21:43:37.198 INFO:teuthology.orchestra.run.smithi124.stdout: }
2022-08-19T21:43:37.198 INFO:teuthology.orchestra.run.smithi124.stdout:}
Image is getting pulled from quay :
2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout:{
2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout: "target_image": "quay.ceph.io/ceph-ci/ceph:67571cd8e74ee004835e626e065891d2f91abffa",
2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout: "in_progress": true,
2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout: "services_complete": [],
2022-08-19T21:43:37.897 INFO:teuthology.orchestra.run.smithi124.stdout: "progress": "",
2022-08-19T21:43:37.897 INFO:teuthology.orchestra.run.smithi124.stdout: "message": "Doing first pull of quay.ceph.io/ceph-ci/ceph:67571cd8e74ee004835e626e065891d2f91abffa image"
2022-08-19T21:43:37.897 INFO:teuthology.orchestra.run.smithi124.stdout:}
Upgrade status at 2022-08-19T21:44:41 :
2022-08-19T21:44:41.590 INFO:teuthology.orchestra.run.smithi124.stdout:{
2022-08-19T21:44:41.590 INFO:teuthology.orchestra.run.smithi124.stdout: "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:c3abbfab17590664cf1e876d5d9df481f898c57e041e218d4fc5ed50ef246bf2",
2022-08-19T21:44:41.590 INFO:teuthology.orchestra.run.smithi124.stdout: "in_progress": true,
2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout: "services_complete": [],
2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout: "progress": "0/21 ceph daemons upgraded",
2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout: "message": "Currently upgrading mgr daemons"
2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout:}
Mon detected "standby mgr started" at 2022-08-19T21:45:23, Why ?
2022-08-19T21:45:23.494 INFO:journalctl@ceph.mon.smithi124.smithi124.stdout:Aug 19 21:45:23 smithi124 ceph-2641a2e4-2007-11ed-8431-001a4aab830c-mon.smithi124[39035]: cluster 2022-08-19T21:45:22.453271+0000 mon.smithi124 (mon.0) 580 : cluster [DBG] Standby manager daemon smithi138.iazeci started
There was no standby mgr at 2022-08-19T21:45:22 which caused upgrade pause :
2022-08-19T21:45:22.591+0000 7f0703abe700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci container digest correct
2022-08-19T21:45:22.591+0000 7f0703abe700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci not deployed by correct version
2022-08-19T21:45:22.592+0000 7f0703abe700 0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
There was no beacon sent by standby mgr for more than 33 seconds
2022-08-19T21:44:41.334+0000 7fcd105d5700 10 mgr operator() config_callback: osd_pool_default_pg_autoscale_mode :
2022-08-19T21:44:42.369+0000 7fcd0d5cf700 10 mgr tick tick
2022-08-19T21:44:42.369+0000 7fcd0d5cf700 20 mgr send_beacon standby
2022-08-19T21:44:42.369+0000 7fcd0d5cf700 10 mgr send_beacon sending beacon as gid 14198
2022-08-19T21:44:42.369+0000 7fcd0d5cf700 1 -- 172.21.15.138:0/1169015240 --> [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] -- mgrbeacon mgr.smithi138.iazeci(2641a2e4-2007-11ed-8431-001a4aab830c,14198, , 0) v10 -- 0x564f5b7c5680 con 0x564f53f50c00
...
<no logs till 2022-08-19T21:45:15>
standby mgr closed mon socket at 2022-08-19T21:45:15 :
2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x559e576dcc00 msgr2=0x559e57769e00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 --2- 172.21.15.138:0/2446933073 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x559e576dcc00 0x559e57769e00 secure :-1 s=READY pgs=42 cs=0 l=1 rev1=1 rx=0x559e5771c180 tx=0x559e577d07e0).stop
2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 shutdown_connections
2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 --2- 172.21.15.138:0/2446933073 >> [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] conn(0x559e576dd000 0x559e5776a300 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rev1=1 rx=0 tx=0).stop
2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 --2- 172.21.15.138:0/2446933073 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x559e576dcc00 0x559e57769e00 unknown :-1 s=CLOSED pgs=42 cs=0 l=1 rev1=1 rx=0 tx=0).stop
2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 >> 172.21.15.138:0/2446933073 conn(0x559e576dc800 msgr2=0x559e583ee000 unknown :-1 s=STATE_NONE l=0).mark_down
2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 shutdown_connections
2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 wait complete
and restarted at 2022-08-19T21:45:15 :
2022-08-19T21:45:15.317+0000 7fd92d4ff680 0 set uid:gid to 167:167 (ceph:ceph)
2022-08-19T21:45:15.317+0000 7fd92d4ff680 0 ceph version 16.2.10-668-g67571cd8 (67571cd8e74ee004835e626e065891d2f91abffa) pacific (stable), process ceph-mgr, pid 7
2022-08-19T21:45:15.317+0000 7fd92d4ff680 0 pidfile_write: ignore empty --pid-file
2022-08-19T21:45:15.318+0000 7fd92d4ff680 1 Processor -- start
2022-08-19T21:45:15.318+0000 7fd92d4ff680 1 -- start start
2022-08-19T21:45:15.318+0000 7fd92d4ff680 4 mgr init Registered monc callback
From mon logs, it looks like standby mgr closed the socket connection 2022-08-19T21:44:42 :
2022-08-19T21:44:42.371+0000 7f5bc2229700 10 mon.smithi124@0(leader).paxosservice(mgr 1..17) dispatch 0x55bf3fba2280 mgrbeacon mgr.smithi138.iazeci(2641a2e4-2007-11ed-8431-001a4aab830c,14198, , 0) v10 from mgr.? 172.21.15.138:0/1169015240 con 0x55bf3e73bc00
2022-08-19T21:44:42.371+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 no_reply to mgr.? 172.21.15.138:0/1169015240 mgrbeacon mgr.smithi138.iazeci(2641a2e4-2007-11ed-8431-001a4aab830c,14198, , 0) v10
2022-08-19T21:44:42.371+0000 7f5bc2229700 20 is_capable service=mgr command= exec addr 172.21.15.138:0/1169015240 on cap allow profile mgr
2022-08-19T21:44:42.900+0000 7f5bc6a32700 1 -- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 msgr2=0x55bf3d8eb700 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 36
2022-08-19T21:44:42.900+0000 7f5bc6a32700 1 -- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 msgr2=0x55bf3d8eb700 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2022-08-19T21:44:42.900+0000 7f5bc6a32700 1 --2- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 0x55bf3d8eb700 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 rx=0x55bf3e8dd680 tx=0x55bf3e9c0900).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2022-08-19T21:44:42.900+0000 7f5bc6a32700 1 --2- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 0x55bf3d8eb700 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 rx=0x55bf3e8dd680 tx=0x55bf3e9c0900).stop
2022-08-19T21:44:42.900+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 ms_handle_reset 0x55bf3e73bc00 172.21.15.138:0/1169015240
2022-08-19T21:44:42.900+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 reset/close on session mgr.? 172.21.15.138:0/1169015240
2022-08-19T21:44:42.900+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 remove_session 0x55bf3efe0000 mgr.? 172.21.15.138:0/1169015240 features 0x3f01cfb9fffdffff
Not sure what could be the reason for standby mgr to close the socket connection with mon here as there are no debug_ms logs for standby mgr.
This is the function (in Pacific) that's checking for an upgrade.
src/pybind/mgr/cephadm/upgrade.py
[...]
We are likely hitting old tracker https://tracker.ceph.com/issues/54419 here for pacific release though the changes related to PR#45361 which fixes tracker#54419 has been backported to pacific in backport PR#45716 (https://github.com/ceph/ceph/pull/45716).
I tried to reproduce the issue 5 times with the same testcase running in --interactive-on-error mode but issue was not reproducible. We will need to revisit this bug again if reproducible on pacific.