Bug #57255
openrados/cephadm/mds_upgrade_sequence, pacific : cephadm [ERR] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
0%
Description
mgrmap from mon log:
2022-08-19T21:38:01.213+0000 7f749d751700 0 log_channel(cluster) log [DBG] : mgrmap e1: no daemons active 2022-08-19T21:38:04.885+0000 7f5bd8c0c700 0 log_channel(cluster) log [DBG] : mgrmap e1: no daemons active 2022-08-19T21:38:11.412+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e2: smithi124.pdzpkv(active, starting, since 0.00261735s) 2022-08-19T21:38:12.415+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e3: smithi124.pdzpkv(active, since 1.00527s) 2022-08-19T21:38:14.417+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e4: smithi124.pdzpkv(active, since 3s) 2022-08-19T21:38:19.420+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e5: smithi124.pdzpkv(active, starting, since 0.0531152s) 2022-08-19T21:38:20.430+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e6: smithi124.pdzpkv(active, since 1.0637s) 2022-08-19T21:38:21.433+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e7: smithi124.pdzpkv(active, since 2s) 2022-08-19T21:38:38.154+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e8: smithi124.pdzpkv(active, since 18s) 2022-08-19T21:38:43.104+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e9: smithi124.pdzpkv(active, starting, since 0.0530632s) 2022-08-19T21:38:44.106+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e10: smithi124.pdzpkv(active, since 1.05519s) 2022-08-19T21:38:45.976+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e11: smithi124.pdzpkv(active, since 2s) 2022-08-19T21:38:50.884+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e12: smithi124.pdzpkv(active, since 7s) 2022-08-19T21:38:55.945+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e13: smithi124.pdzpkv(active, starting, since 0.053293s) 2022-08-19T21:38:56.952+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e14: smithi124.pdzpkv(active, since 1.06024s) 2022-08-19T21:38:58.852+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e15: smithi124.pdzpkv(active, since 2s) 2022-08-19T21:39:02.758+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e16: smithi124.pdzpkv(active, since 6s) 2022-08-19T21:40:00.612+0000 7f5bc2229700 0 log_channel(cluster) log [DBG] : mgrmap e16: smithi124.pdzpkv(active, since 64s) 2022-08-19T21:40:00.669+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e17: smithi124.pdzpkv(active, since 64s), standbys: smithi138.iazeci 2022-08-19T21:45:14.976+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e18: smithi124.pdzpkv(active, since 6m) 2022-08-19T21:45:23.103+0000 7f5bc0a26700 0 log_channel(cluster) log [DBG] : mgrmap e19: smithi124.pdzpkv(active, since 6m), standbys: smithi138.iazeci
mon log :
2022-08-19T21:45:22.600+0000 7f5bc2229700 20 mon.smithi124@0(leader).mgrstat health checks: { "UPGRADE_NO_STANDBY_MGR": { "severity": "HEALTH_WARN", "summary": { "message": "Upgrade: Need standby mgr daemon", "count": 1 }, "detail": [ { "message": "The upgrade process needs to upgrade the mgr, but it needs at least one standby to proceed." } ] } } 2022-08-19T21:45:22.600+0000 7f5bc2229700 20 mon.smithi124@0(leader).mgrstat progress events: { "dc77e5f1-4e84-448e-9ca0-a8322772fa2b": { "message": "Upgrade to 16.2.10-668-g67571cd8 (33s)\n [............................] ", "progress": 0, "add_to_ceph_s": true } } ... 2022-08-19T21:45:22.987+0000 7f5bc2229700 10 mon.smithi124@0(leader).log v297 logging 2022-08-19T21:45:22.592032+0000 mgr.smithi124.pdzpkv (mgr.14162) 288 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.smithi124.pdzpkv) 2022-08-19T21:45:22.987+0000 7f5bc2229700 10 mon.smithi124@0(leader).log v297 logging 2022-08-19T21:45:22.593440+0000 mgr.smithi124.pdzpkv (mgr.14162) 289 : cephadm [ERR] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
Updated by Prashant D over 1 year ago
Updated by Laura Flores over 1 year ago
Based on the error message in the mgr log, it looks like there is a problem with the mgr version:
/a/yuriw-2022-08-19_20:57:42-rados-wip-yuri6-testing-2022-08-19-0940-pacific-distro-default-smithi/6981184/remote/ubuntu@smithi124.front.sepia.ceph.com/log/2641a2e4-2007-11ed-8431-001a4aab830c/ceph-mgr.smithi124.pdzpkv.log.gz
2022-08-19T21:45:22.590+0000 7f070d053700 20 mgr ~Gil Destroying new thread state 0x55dac52f8000
2022-08-19T21:45:22.591+0000 7f06e227b700 0 [dashboard DEBUG notification_queue] processing queue: 1
2022-08-19T21:45:22.591+0000 7f0703abe700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci container digest correct
2022-08-19T21:45:22.591+0000 7f0703abe700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci not deployed by correct version
2022-08-19T21:45:22.592+0000 7f0703abe700 0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
2022-08-19T21:45:22.592+0000 7f0703abe700 -1 log_channel(cephadm) log [ERR] : Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
This is the function (in Pacific) that's checking for an upgrade.
src/pybind/mgr/cephadm/upgrade.py
def _detect_need_upgrade(self, daemons: List[DaemonDescription], target_digests: Optional[List[str]] = None) -> Tuple[bool, List[Tuple[DaemonDescription, bool]], List[Tuple[DaemonDescription, bool]], int]:
# this function takes a list of daemons and container digests. The purpose
# is to go through each daemon and check if the current container digests
# for that daemon match the target digests. The purpose being that we determine
# if a daemon is upgraded to a certain container image or not based on what
# container digests it has. By checking the current digests against the
# targets we can determine which daemons still need to be upgraded
need_upgrade_self = False
need_upgrade: List[Tuple[DaemonDescription, bool]] = []
need_upgrade_deployer: List[Tuple[DaemonDescription, bool]] = []
done = 0
if target_digests is None:
target_digests = []
for d in daemons:
assert d.daemon_type is not None
assert d.daemon_id is not None
assert d.hostname is not None
correct_digest = False
if (any(d in target_digests for d in (d.container_image_digests or []))
or d.daemon_type in MONITORING_STACK_TYPES):
logger.debug('daemon %s.%s container digest correct' % (
d.daemon_type, d.daemon_id))
correct_digest = True
if any(d in target_digests for d in (d.deployed_by or [])):
logger.debug('daemon %s.%s deployed by correct version' % (
d.daemon_type, d.daemon_id))
done += 1
continue
if self.mgr.daemon_is_self(d.daemon_type, d.daemon_id):
logger.info('Upgrade: Need to upgrade myself (mgr.%s)' %
self.mgr.get_mgr_id())
need_upgrade_self = True
continue
if correct_digest:
logger.debug('daemon %s.%s not deployed by correct version' % (
d.daemon_type, d.daemon_id))
need_upgrade_deployer.append((d, True))
else:
logger.debug('daemon %s.%s not correct (%s, %s, %s)' % (
d.daemon_type, d.daemon_id,
d.container_image_name, d.container_image_digests, d.version))
need_upgrade.append((d, False))
return (need_upgrade_self, need_upgrade, need_upgrade_deployer, done)
Updated by Prashant D over 1 year ago
Laura Flores wrote:
Based on the error message in the mgr log, it looks like there is a problem with the mgr version:
/a/yuriw-2022-08-19_20:57:42-rados-wip-yuri6-testing-2022-08-19-0940-pacific-distro-default-smithi/6981184/remote/ubuntu@smithi124.front.sepia.ceph.com/log/2641a2e4-2007-11ed-8431-001a4aab830c/ceph-mgr.smithi124.pdzpkv.log.gz
[...]
There is no issue with mgr version, the problem is that there was no standby mgr in mgrmap when active mgr was getting upgraded.
mgrmap during job run :
$ zcat ./remote/ubuntu@smithi124.front.sepia.ceph.com/log/2641a2e4-2007-11ed-8431-001a4aab830c/ceph-mgr.smithi124.pdzpkv.log.gz|grep got_mgr_map 2022-08-19T21:38:11.413+0000 7f2849fc7700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.00365193s) 2022-08-19T21:38:12.413+0000 7f2849fc7700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.00326s) 2022-08-19T21:38:19.421+0000 7f8fb401e700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.0544349s) 2022-08-19T21:38:20.428+0000 7f8fb401e700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.06162s) 2022-08-19T21:38:21.431+0000 7f8fb401e700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 2s) 2022-08-19T21:38:43.105+0000 7f3b4fbdb700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.0547095s) 2022-08-19T21:38:44.104+0000 7f3b4fbdb700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.05378s) 2022-08-19T21:38:45.975+0000 7f3b4fbdb700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 2s) 2022-08-19T21:38:55.946+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.0546631s) 2022-08-19T21:38:56.950+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.05845s) 2022-08-19T21:38:58.851+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 2s) 2022-08-19T21:39:02.757+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 6s) 2022-08-19T21:40:00.667+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 64s), standbys: smithi138.iazeci 2022-08-19T21:45:14.974+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 6m) 2022-08-19T21:45:23.100+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 6m), standbys: smithi138.iazeci 2022-08-19T21:47:21.093+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 8m), standbys: smithi138.iazeci
Let's track down why there was no standby around 2022-08-19T21:45:14 and till 2022-08-19T21:45:23
Upgrade started at 2022-08-19T21:43:31
2022-08-19T21:43:31.906 DEBUG:teuthology.orchestra.run.smithi124:> sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/ceph:v16.2.4 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 2641a2e4-2007-11ed-8431-001a4aab830c -e sha1=67571cd8e74ee004835e626e065891d2f91abffa -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'
Upgrade yet to be initiated for mgr. Both mgr were running at 2022-08-19T21:43:36 :
2022-08-19T21:43:36.763 INFO:teuthology.orchestra.run.smithi124.stderr:2022-08-19T21:43:36.760+0000 7f3443fff700 1 -- 172.21.15.124:0/343422277 <== mgr.14162 v2:172.21.15.124:6800/2654071450 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+2728 (secure 0 0 0) 0x7f345c057f10 con 0x7f344405c300 2022-08-19T21:43:36.764 INFO:teuthology.orchestra.run.smithi124.stdout:NAME HOST PORTS STATUS REFRESHED AGE VERSION IMAGE ID CONTAINER ID 2022-08-19T21:43:36.764 INFO:teuthology.orchestra.run.smithi124.stdout:alertmanager.smithi124 smithi124 *:9093,9094 running (3m) 16s ago 4m 0.20.0 0881eb8f169f 11f1d2920b71 2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:crash.smithi124 smithi124 running (4m) 16s ago 4m 16.2.4 8d91d370c2b8 717b26481548 2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:crash.smithi138 smithi138 running (3m) 19s ago 3m 16.2.4 8d91d370c2b8 979a5d1dac14 2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:grafana.smithi124 smithi124 *:3000 running (3m) 16s ago 4m 6.7.4 557c83e11646 15bd9e7ccae6 2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi124.dvbrra smithi124 running (28s) 16s ago 28s 16.2.4 8d91d370c2b8 3b8d2b3a0d7c 2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi124.vduioy smithi124 running (33s) 16s ago 32s 16.2.4 8d91d370c2b8 f667e53107a1 2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi138.ekgwds smithi138 running (31s) 19s ago 30s 16.2.4 8d91d370c2b8 b601f18ef5b9 2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi138.kxgbaf smithi138 running (35s) 19s ago 35s 16.2.4 8d91d370c2b8 272a9b77d8f9 2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mgr.smithi124.pdzpkv smithi124 *:9283 running (5m) 16s ago 5m 16.2.4 8d91d370c2b8 5e030cba4b77 2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mgr.smithi138.iazeci smithi138 *:8443,9283 running (3m) 19s ago 3m 16.2.4 8d91d370c2b8 b53c1efe18c1 2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:mon.smithi124 smithi124 running (5m) 16s ago 5m 16.2.4 8d91d370c2b8 b6fde5c9302e 2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:mon.smithi138 smithi138 running (3m) 19s ago 3m 16.2.4 8d91d370c2b8 820d1ed4e18a 2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:node-exporter.smithi124 smithi124 *:9100 running (4m) 16s ago 4m 0.18.1 e5a616e4b9cf a45067e8714e 2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:node-exporter.smithi138 smithi138 *:9100 running (3m) 19s ago 3m 0.18.1 e5a616e4b9cf 21ba865f8a0f 2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:osd.0 smithi124 running (3m) 16s ago 3m 16.2.4 8d91d370c2b8 c10bc4caeb1b 2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.1 smithi124 running (2m) 16s ago 2m 16.2.4 8d91d370c2b8 b75a42c340ec 2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.2 smithi124 running (2m) 16s ago 2m 16.2.4 8d91d370c2b8 24f778d4d61b 2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.3 smithi138 running (2m) 19s ago 2m 16.2.4 8d91d370c2b8 749e01c0fd84 2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.4 smithi138 running (2m) 19s ago 2m 16.2.4 8d91d370c2b8 0676f04ba553 2022-08-19T21:43:36.769 INFO:teuthology.orchestra.run.smithi124.stdout:osd.5 smithi138 running (109s) 19s ago 108s 16.2.4 8d91d370c2b8 06ce538c727f 2022-08-19T21:43:36.769 INFO:teuthology.orchestra.run.smithi124.stdout:prometheus.smithi124 smithi124 *:9095 running (3m) 16s ago 4m 2.18.1 de242295e225 02898ed8a2a5 ... ... 2022-08-19T21:43:37.194 INFO:teuthology.orchestra.run.smithi124.stdout:{ 2022-08-19T21:43:37.194 INFO:teuthology.orchestra.run.smithi124.stdout: "mon": { 2022-08-19T21:43:37.194 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 2 2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout: }, 2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout: "mgr": { 2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 2 2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout: }, 2022-08-19T21:43:37.196 INFO:teuthology.orchestra.run.smithi124.stdout: "osd": { 2022-08-19T21:43:37.196 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 6 2022-08-19T21:43:37.196 INFO:teuthology.orchestra.run.smithi124.stdout: }, 2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: "mds": { 2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 4 2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: }, 2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: "overall": { 2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout: "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 14 2022-08-19T21:43:37.198 INFO:teuthology.orchestra.run.smithi124.stdout: } 2022-08-19T21:43:37.198 INFO:teuthology.orchestra.run.smithi124.stdout:}
Image is getting pulled from quay :
2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout:{ 2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout: "target_image": "quay.ceph.io/ceph-ci/ceph:67571cd8e74ee004835e626e065891d2f91abffa", 2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout: "in_progress": true, 2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout: "services_complete": [], 2022-08-19T21:43:37.897 INFO:teuthology.orchestra.run.smithi124.stdout: "progress": "", 2022-08-19T21:43:37.897 INFO:teuthology.orchestra.run.smithi124.stdout: "message": "Doing first pull of quay.ceph.io/ceph-ci/ceph:67571cd8e74ee004835e626e065891d2f91abffa image" 2022-08-19T21:43:37.897 INFO:teuthology.orchestra.run.smithi124.stdout:}
Upgrade status at 2022-08-19T21:44:41 :
2022-08-19T21:44:41.590 INFO:teuthology.orchestra.run.smithi124.stdout:{ 2022-08-19T21:44:41.590 INFO:teuthology.orchestra.run.smithi124.stdout: "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:c3abbfab17590664cf1e876d5d9df481f898c57e041e218d4fc5ed50ef246bf2", 2022-08-19T21:44:41.590 INFO:teuthology.orchestra.run.smithi124.stdout: "in_progress": true, 2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout: "services_complete": [], 2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout: "progress": "0/21 ceph daemons upgraded", 2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout: "message": "Currently upgrading mgr daemons" 2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout:}
Mon detected "standby mgr started" at 2022-08-19T21:45:23, Why ?
2022-08-19T21:45:23.494 INFO:journalctl@ceph.mon.smithi124.smithi124.stdout:Aug 19 21:45:23 smithi124 ceph-2641a2e4-2007-11ed-8431-001a4aab830c-mon.smithi124[39035]: cluster 2022-08-19T21:45:22.453271+0000 mon.smithi124 (mon.0) 580 : cluster [DBG] Standby manager daemon smithi138.iazeci started
There was no standby mgr at 2022-08-19T21:45:22 which caused upgrade pause :
2022-08-19T21:45:22.591+0000 7f0703abe700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci container digest correct 2022-08-19T21:45:22.591+0000 7f0703abe700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci not deployed by correct version 2022-08-19T21:45:22.592+0000 7f0703abe700 0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
There was no beacon sent by standby mgr for more than 33 seconds
2022-08-19T21:44:41.334+0000 7fcd105d5700 10 mgr operator() config_callback: osd_pool_default_pg_autoscale_mode : 2022-08-19T21:44:42.369+0000 7fcd0d5cf700 10 mgr tick tick 2022-08-19T21:44:42.369+0000 7fcd0d5cf700 20 mgr send_beacon standby 2022-08-19T21:44:42.369+0000 7fcd0d5cf700 10 mgr send_beacon sending beacon as gid 14198 2022-08-19T21:44:42.369+0000 7fcd0d5cf700 1 -- 172.21.15.138:0/1169015240 --> [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] -- mgrbeacon mgr.smithi138.iazeci(2641a2e4-2007-11ed-8431-001a4aab830c,14198, , 0) v10 -- 0x564f5b7c5680 con 0x564f53f50c00 ... <no logs till 2022-08-19T21:45:15>
standby mgr closed mon socket at 2022-08-19T21:45:15 :
2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x559e576dcc00 msgr2=0x559e57769e00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down 2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 --2- 172.21.15.138:0/2446933073 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x559e576dcc00 0x559e57769e00 secure :-1 s=READY pgs=42 cs=0 l=1 rev1=1 rx=0x559e5771c180 tx=0x559e577d07e0).stop 2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 shutdown_connections 2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 --2- 172.21.15.138:0/2446933073 >> [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] conn(0x559e576dd000 0x559e5776a300 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rev1=1 rx=0 tx=0).stop 2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 --2- 172.21.15.138:0/2446933073 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x559e576dcc00 0x559e57769e00 unknown :-1 s=CLOSED pgs=42 cs=0 l=1 rev1=1 rx=0 tx=0).stop 2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 >> 172.21.15.138:0/2446933073 conn(0x559e576dc800 msgr2=0x559e583ee000 unknown :-1 s=STATE_NONE l=0).mark_down 2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 shutdown_connections 2022-08-19T21:45:15.317+0000 7fd92d4ff680 1 -- 172.21.15.138:0/2446933073 wait complete
and restarted at 2022-08-19T21:45:15 :
2022-08-19T21:45:15.317+0000 7fd92d4ff680 0 set uid:gid to 167:167 (ceph:ceph) 2022-08-19T21:45:15.317+0000 7fd92d4ff680 0 ceph version 16.2.10-668-g67571cd8 (67571cd8e74ee004835e626e065891d2f91abffa) pacific (stable), process ceph-mgr, pid 7 2022-08-19T21:45:15.317+0000 7fd92d4ff680 0 pidfile_write: ignore empty --pid-file 2022-08-19T21:45:15.318+0000 7fd92d4ff680 1 Processor -- start 2022-08-19T21:45:15.318+0000 7fd92d4ff680 1 -- start start 2022-08-19T21:45:15.318+0000 7fd92d4ff680 4 mgr init Registered monc callback
From mon logs, it looks like standby mgr closed the socket connection 2022-08-19T21:44:42 :
2022-08-19T21:44:42.371+0000 7f5bc2229700 10 mon.smithi124@0(leader).paxosservice(mgr 1..17) dispatch 0x55bf3fba2280 mgrbeacon mgr.smithi138.iazeci(2641a2e4-2007-11ed-8431-001a4aab830c,14198, , 0) v10 from mgr.? 172.21.15.138:0/1169015240 con 0x55bf3e73bc00 2022-08-19T21:44:42.371+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 no_reply to mgr.? 172.21.15.138:0/1169015240 mgrbeacon mgr.smithi138.iazeci(2641a2e4-2007-11ed-8431-001a4aab830c,14198, , 0) v10 2022-08-19T21:44:42.371+0000 7f5bc2229700 20 is_capable service=mgr command= exec addr 172.21.15.138:0/1169015240 on cap allow profile mgr 2022-08-19T21:44:42.900+0000 7f5bc6a32700 1 -- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 msgr2=0x55bf3d8eb700 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 36 2022-08-19T21:44:42.900+0000 7f5bc6a32700 1 -- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 msgr2=0x55bf3d8eb700 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2022-08-19T21:44:42.900+0000 7f5bc6a32700 1 --2- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 0x55bf3d8eb700 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 rx=0x55bf3e8dd680 tx=0x55bf3e9c0900).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted) 2022-08-19T21:44:42.900+0000 7f5bc6a32700 1 --2- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 0x55bf3d8eb700 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 rx=0x55bf3e8dd680 tx=0x55bf3e9c0900).stop 2022-08-19T21:44:42.900+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 ms_handle_reset 0x55bf3e73bc00 172.21.15.138:0/1169015240 2022-08-19T21:44:42.900+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 reset/close on session mgr.? 172.21.15.138:0/1169015240 2022-08-19T21:44:42.900+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 remove_session 0x55bf3efe0000 mgr.? 172.21.15.138:0/1169015240 features 0x3f01cfb9fffdffff
Not sure what could be the reason for standby mgr to close the socket connection with mon here as there are no debug_ms logs for standby mgr.
This is the function (in Pacific) that's checking for an upgrade.
src/pybind/mgr/cephadm/upgrade.py
[...]
We are likely hitting old tracker https://tracker.ceph.com/issues/54419 here for pacific release though the changes related to PR#45361 which fixes tracker#54419 has been backported to pacific in backport PR#45716 (https://github.com/ceph/ceph/pull/45716).
I tried to reproduce the issue 5 times with the same testcase running in --interactive-on-error mode but issue was not reproducible. We will need to revisit this bug again if reproducible on pacific.
Updated by Prashant D over 1 year ago
It looks like corner case to me where standby mgr somehow closing the connection with mon during mgr upgrade which causes this issue. We likely need to increase timeout for the testcase to detect the standby mgr to continue with mgr upgrade.
Updated by Laura Flores over 1 year ago
/a/amathuri-2022-09-08_14:19:23-rados-wip-yuri5-testing-2022-09-06-1334-pacific-distro-default-smithi/7020752
Updated by Laura Flores over 1 year ago
- Translation missing: en.field_tag_list set to test-failure
Updated by Laura Flores over 1 year ago
- Backport set to pacific
/a/yuriw-2022-09-20_17:39:55-rados-wip-yuri5-testing-2022-09-19-1007-pacific-distro-default-smithi/7039447
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-03-13_19:57:13-rados-wip-yuri6-testing-2023-03-12-0918-pacific-distro-default-smithi/7205777
Updated by Laura Flores 12 months ago
/a/lflores-2023-05-03_19:12:23-rados-wip-yuri5-testing-2023-04-25-0837-pacific-distro-default-smithi/7261585
Updated by Venky Shankar 7 months ago
- Related to Bug #54419: `ceph orch upgrade start` seems to never reach completion added