Project

General

Profile

Actions

Bug #57255

open

rados/cephadm/mds_upgrade_sequence, pacific : cephadm [ERR] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon

Added by Prashant D over 1 year ago. Updated 12 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

mgrmap from mon log:

2022-08-19T21:38:01.213+0000 7f749d751700  0 log_channel(cluster) log [DBG] : mgrmap e1: no daemons active
2022-08-19T21:38:04.885+0000 7f5bd8c0c700  0 log_channel(cluster) log [DBG] : mgrmap e1: no daemons active
2022-08-19T21:38:11.412+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e2: smithi124.pdzpkv(active, starting, since 0.00261735s)
2022-08-19T21:38:12.415+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e3: smithi124.pdzpkv(active, since 1.00527s)
2022-08-19T21:38:14.417+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e4: smithi124.pdzpkv(active, since 3s)
2022-08-19T21:38:19.420+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e5: smithi124.pdzpkv(active, starting, since 0.0531152s)
2022-08-19T21:38:20.430+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e6: smithi124.pdzpkv(active, since 1.0637s)
2022-08-19T21:38:21.433+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e7: smithi124.pdzpkv(active, since 2s)
2022-08-19T21:38:38.154+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e8: smithi124.pdzpkv(active, since 18s)
2022-08-19T21:38:43.104+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e9: smithi124.pdzpkv(active, starting, since 0.0530632s)
2022-08-19T21:38:44.106+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e10: smithi124.pdzpkv(active, since 1.05519s)
2022-08-19T21:38:45.976+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e11: smithi124.pdzpkv(active, since 2s)
2022-08-19T21:38:50.884+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e12: smithi124.pdzpkv(active, since 7s)
2022-08-19T21:38:55.945+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e13: smithi124.pdzpkv(active, starting, since 0.053293s)
2022-08-19T21:38:56.952+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e14: smithi124.pdzpkv(active, since 1.06024s)
2022-08-19T21:38:58.852+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e15: smithi124.pdzpkv(active, since 2s)
2022-08-19T21:39:02.758+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e16: smithi124.pdzpkv(active, since 6s)
2022-08-19T21:40:00.612+0000 7f5bc2229700  0 log_channel(cluster) log [DBG] : mgrmap e16: smithi124.pdzpkv(active, since 64s)
2022-08-19T21:40:00.669+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e17: smithi124.pdzpkv(active, since 64s), standbys: smithi138.iazeci
2022-08-19T21:45:14.976+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e18: smithi124.pdzpkv(active, since 6m)
2022-08-19T21:45:23.103+0000 7f5bc0a26700  0 log_channel(cluster) log [DBG] : mgrmap e19: smithi124.pdzpkv(active, since 6m), standbys: smithi138.iazeci

mon log :

2022-08-19T21:45:22.600+0000 7f5bc2229700 20 mon.smithi124@0(leader).mgrstat health checks:
{
    "UPGRADE_NO_STANDBY_MGR": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Upgrade: Need standby mgr daemon",
            "count": 1
        },
        "detail": [
            {
                "message": "The upgrade process needs to upgrade the mgr, but it needs at least one standby to proceed." 
            }
        ]
    }
}

2022-08-19T21:45:22.600+0000 7f5bc2229700 20 mon.smithi124@0(leader).mgrstat progress events:
{
    "dc77e5f1-4e84-448e-9ca0-a8322772fa2b": {
        "message": "Upgrade to 16.2.10-668-g67571cd8 (33s)\n      [............................] ",
        "progress": 0,
        "add_to_ceph_s": true
    }
}
...
2022-08-19T21:45:22.987+0000 7f5bc2229700 10 mon.smithi124@0(leader).log v297  logging 2022-08-19T21:45:22.592032+0000 mgr.smithi124.pdzpkv (mgr.14162) 288 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.smithi124.pdzpkv)
2022-08-19T21:45:22.987+0000 7f5bc2229700 10 mon.smithi124@0(leader).log v297  logging 2022-08-19T21:45:22.593440+0000 mgr.smithi124.pdzpkv (mgr.14162) 289 : cephadm [ERR] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon

https://pulpito.ceph.com/yuriw-2022-08-19_20:57:42-rados-wip-yuri6-testing-2022-08-19-0940-pacific-distro-default-smithi/6981184


Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #54419: `ceph orch upgrade start` seems to never reach completionDuplicateAdam King

Actions
Actions #1

Updated by Prashant D over 1 year ago

  • Description updated (diff)
Actions #3

Updated by Laura Flores over 1 year ago

Based on the error message in the mgr log, it looks like there is a problem with the mgr version:

/a/yuriw-2022-08-19_20:57:42-rados-wip-yuri6-testing-202/log/2641a2e4-2007-11ed-8431-001a4aab830c/ceph-mgr.smithi124.pdzpkv.log.gz

2022-08-19T21:45:22.590+0000 7f070d053700 20 mgr ~Gil Destroying new thread state 0x55dac52f8000
2022-08-19T21:45:22.591+0000 7f06e227b700  0 [dashboard DEBUG notification_queue] processing queue: 1
2022-08-19T21:45:22.591+0000 7f0703abe700  0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci container digest correct
2022-08-19T21:45:22.591+0000 7f0703abe700  0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci not deployed by correct version
2022-08-19T21:45:22.592+0000 7f0703abe700  0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
2022-08-19T21:45:22.592+0000 7f0703abe700 -1 log_channel(cephadm) log [ERR] : Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon

This is the function (in Pacific) that's checking for an upgrade.
src/pybind/mgr/cephadm/upgrade.py

    def _detect_need_upgrade(self, daemons: List[DaemonDescription], target_digests: Optional[List[str]] = None) -> Tuple[bool, List[Tuple[DaemonDescription, bool]], List[Tuple[DaemonDescription, bool]], int]:
        # this function takes a list of daemons and container digests. The purpose
        # is to go through each daemon and check if the current container digests
        # for that daemon match the target digests. The purpose being that we determine
        # if a daemon is upgraded to a certain container image or not based on what
        # container digests it has. By checking the current digests against the
        # targets we can determine which daemons still need to be upgraded
        need_upgrade_self = False
        need_upgrade: List[Tuple[DaemonDescription, bool]] = []
        need_upgrade_deployer: List[Tuple[DaemonDescription, bool]] = []
        done = 0
        if target_digests is None:
            target_digests = []
        for d in daemons:
            assert d.daemon_type is not None
            assert d.daemon_id is not None
            assert d.hostname is not None
            correct_digest = False
            if (any(d in target_digests for d in (d.container_image_digests or []))
                    or d.daemon_type in MONITORING_STACK_TYPES):
                logger.debug('daemon %s.%s container digest correct' % (
                    d.daemon_type, d.daemon_id))
                correct_digest = True
                if any(d in target_digests for d in (d.deployed_by or [])):
                    logger.debug('daemon %s.%s deployed by correct version' % (
                        d.daemon_type, d.daemon_id))
                    done += 1
                    continue

            if self.mgr.daemon_is_self(d.daemon_type, d.daemon_id):
                logger.info('Upgrade: Need to upgrade myself (mgr.%s)' %
                            self.mgr.get_mgr_id())
                need_upgrade_self = True
                continue

            if correct_digest:
                logger.debug('daemon %s.%s not deployed by correct version' % (
                    d.daemon_type, d.daemon_id))
                need_upgrade_deployer.append((d, True))
            else:
                logger.debug('daemon %s.%s not correct (%s, %s, %s)' % (
                    d.daemon_type, d.daemon_id,
                    d.container_image_name, d.container_image_digests, d.version))
                need_upgrade.append((d, False))

        return (need_upgrade_self, need_upgrade, need_upgrade_deployer, done)

Actions #4

Updated by Prashant D over 1 year ago

Laura Flores wrote:

Based on the error message in the mgr log, it looks like there is a problem with the mgr version:

/a/yuriw-2022-08-19_20:57:42-rados-wip-yuri6-testing-202/log/2641a2e4-2007-11ed-8431-001a4aab830c/ceph-mgr.smithi124.pdzpkv.log.gz
[...]

There is no issue with mgr version, the problem is that there was no standby mgr in mgrmap when active mgr was getting upgraded.

mgrmap during job run :

$ zcat ./remote/ubuntu@smithi124.front.sepia.ceph.com/log/2641a2e4-2007-11ed-8431-001a4aab830c/ceph-mgr.smithi124.pdzpkv.log.gz|grep got_mgr_map
2022-08-19T21:38:11.413+0000 7f2849fc7700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.00365193s)
2022-08-19T21:38:12.413+0000 7f2849fc7700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.00326s)
2022-08-19T21:38:19.421+0000 7f8fb401e700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.0544349s)
2022-08-19T21:38:20.428+0000 7f8fb401e700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.06162s)
2022-08-19T21:38:21.431+0000 7f8fb401e700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 2s)
2022-08-19T21:38:43.105+0000 7f3b4fbdb700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.0547095s)
2022-08-19T21:38:44.104+0000 7f3b4fbdb700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.05378s)
2022-08-19T21:38:45.975+0000 7f3b4fbdb700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 2s)
2022-08-19T21:38:55.946+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, starting, since 0.0546631s)
2022-08-19T21:38:56.950+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 1.05845s)
2022-08-19T21:38:58.851+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 2s)
2022-08-19T21:39:02.757+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 6s)
2022-08-19T21:40:00.667+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 64s), standbys: smithi138.iazeci
2022-08-19T21:45:14.974+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 6m)
2022-08-19T21:45:23.100+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 6m), standbys: smithi138.iazeci
2022-08-19T21:47:21.093+0000 7f0730c20700 10 mgr got_mgr_map smithi124.pdzpkv(active, since 8m), standbys: smithi138.iazeci

Let's track down why there was no standby around 2022-08-19T21:45:14 and till 2022-08-19T21:45:23

Upgrade started at 2022-08-19T21:43:31

2022-08-19T21:43:31.906 DEBUG:teuthology.orchestra.run.smithi124:> sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/ceph:v16.2.4 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 2641a2e4-2007-11ed-8431-001a4aab830c -e sha1=67571cd8e74ee004835e626e065891d2f91abffa -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'

Upgrade yet to be initiated for mgr. Both mgr were running at 2022-08-19T21:43:36 :

2022-08-19T21:43:36.763 INFO:teuthology.orchestra.run.smithi124.stderr:2022-08-19T21:43:36.760+0000 7f3443fff700  1 -- 172.21.15.124:0/343422277 <== mgr.14162 v2:172.21.15.124:6800/2654071450 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+2728 (secure 0 0 0) 0x7f345c057f10 con 0x7f344405c300
2022-08-19T21:43:36.764 INFO:teuthology.orchestra.run.smithi124.stdout:NAME                         HOST       PORTS        STATUS          REFRESHED  AGE   VERSION  IMAGE ID      CONTAINER ID
2022-08-19T21:43:36.764 INFO:teuthology.orchestra.run.smithi124.stdout:alertmanager.smithi124       smithi124  *:9093,9094  running (3m)    16s ago    4m    0.20.0   0881eb8f169f  11f1d2920b71
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:crash.smithi124              smithi124               running (4m)    16s ago    4m    16.2.4   8d91d370c2b8  717b26481548
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:crash.smithi138              smithi138               running (3m)    19s ago    3m    16.2.4   8d91d370c2b8  979a5d1dac14
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:grafana.smithi124            smithi124  *:3000       running (3m)    16s ago    4m    6.7.4    557c83e11646  15bd9e7ccae6
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi124.dvbrra  smithi124               running (28s)   16s ago    28s   16.2.4   8d91d370c2b8  3b8d2b3a0d7c
2022-08-19T21:43:36.765 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi124.vduioy  smithi124               running (33s)   16s ago    32s   16.2.4   8d91d370c2b8  f667e53107a1
2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi138.ekgwds  smithi138               running (31s)   19s ago    30s   16.2.4   8d91d370c2b8  b601f18ef5b9
2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mds.cephfs.smithi138.kxgbaf  smithi138               running (35s)   19s ago    35s   16.2.4   8d91d370c2b8  272a9b77d8f9
2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mgr.smithi124.pdzpkv         smithi124  *:9283       running (5m)    16s ago    5m    16.2.4   8d91d370c2b8  5e030cba4b77
2022-08-19T21:43:36.766 INFO:teuthology.orchestra.run.smithi124.stdout:mgr.smithi138.iazeci         smithi138  *:8443,9283  running (3m)    19s ago    3m    16.2.4   8d91d370c2b8  b53c1efe18c1
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:mon.smithi124                smithi124               running (5m)    16s ago    5m    16.2.4   8d91d370c2b8  b6fde5c9302e
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:mon.smithi138                smithi138               running (3m)    19s ago    3m    16.2.4   8d91d370c2b8  820d1ed4e18a
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:node-exporter.smithi124      smithi124  *:9100       running (4m)    16s ago    4m    0.18.1   e5a616e4b9cf  a45067e8714e
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:node-exporter.smithi138      smithi138  *:9100       running (3m)    19s ago    3m    0.18.1   e5a616e4b9cf  21ba865f8a0f
2022-08-19T21:43:36.767 INFO:teuthology.orchestra.run.smithi124.stdout:osd.0                        smithi124               running (3m)    16s ago    3m    16.2.4   8d91d370c2b8  c10bc4caeb1b
2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.1                        smithi124               running (2m)    16s ago    2m    16.2.4   8d91d370c2b8  b75a42c340ec
2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.2                        smithi124               running (2m)    16s ago    2m    16.2.4   8d91d370c2b8  24f778d4d61b
2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.3                        smithi138               running (2m)    19s ago    2m    16.2.4   8d91d370c2b8  749e01c0fd84
2022-08-19T21:43:36.768 INFO:teuthology.orchestra.run.smithi124.stdout:osd.4                        smithi138               running (2m)    19s ago    2m    16.2.4   8d91d370c2b8  0676f04ba553
2022-08-19T21:43:36.769 INFO:teuthology.orchestra.run.smithi124.stdout:osd.5                        smithi138               running (109s)  19s ago    108s  16.2.4   8d91d370c2b8  06ce538c727f
2022-08-19T21:43:36.769 INFO:teuthology.orchestra.run.smithi124.stdout:prometheus.smithi124         smithi124  *:9095       running (3m)    16s ago    4m    2.18.1   de242295e225  02898ed8a2a5
...
...
2022-08-19T21:43:37.194 INFO:teuthology.orchestra.run.smithi124.stdout:{
2022-08-19T21:43:37.194 INFO:teuthology.orchestra.run.smithi124.stdout:    "mon": {
2022-08-19T21:43:37.194 INFO:teuthology.orchestra.run.smithi124.stdout:        "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 2
2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout:    },
2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout:    "mgr": {
2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout:        "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 2
2022-08-19T21:43:37.195 INFO:teuthology.orchestra.run.smithi124.stdout:    },
2022-08-19T21:43:37.196 INFO:teuthology.orchestra.run.smithi124.stdout:    "osd": {
2022-08-19T21:43:37.196 INFO:teuthology.orchestra.run.smithi124.stdout:        "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 6
2022-08-19T21:43:37.196 INFO:teuthology.orchestra.run.smithi124.stdout:    },
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout:    "mds": {
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout:        "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 4
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout:    },
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout:    "overall": {
2022-08-19T21:43:37.197 INFO:teuthology.orchestra.run.smithi124.stdout:        "ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)": 14
2022-08-19T21:43:37.198 INFO:teuthology.orchestra.run.smithi124.stdout:    }
2022-08-19T21:43:37.198 INFO:teuthology.orchestra.run.smithi124.stdout:}

Image is getting pulled from quay :

2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout:{
2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout:    "target_image": "quay.ceph.io/ceph-ci/ceph:67571cd8e74ee004835e626e065891d2f91abffa",
2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout:    "in_progress": true,
2022-08-19T21:43:37.896 INFO:teuthology.orchestra.run.smithi124.stdout:    "services_complete": [],
2022-08-19T21:43:37.897 INFO:teuthology.orchestra.run.smithi124.stdout:    "progress": "",
2022-08-19T21:43:37.897 INFO:teuthology.orchestra.run.smithi124.stdout:    "message": "Doing first pull of quay.ceph.io/ceph-ci/ceph:67571cd8e74ee004835e626e065891d2f91abffa image" 
2022-08-19T21:43:37.897 INFO:teuthology.orchestra.run.smithi124.stdout:}

Upgrade status at 2022-08-19T21:44:41 :

2022-08-19T21:44:41.590 INFO:teuthology.orchestra.run.smithi124.stdout:{
2022-08-19T21:44:41.590 INFO:teuthology.orchestra.run.smithi124.stdout:    "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:c3abbfab17590664cf1e876d5d9df481f898c57e041e218d4fc5ed50ef246bf2",
2022-08-19T21:44:41.590 INFO:teuthology.orchestra.run.smithi124.stdout:    "in_progress": true,
2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout:    "services_complete": [],
2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout:    "progress": "0/21 ceph daemons upgraded",
2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout:    "message": "Currently upgrading mgr daemons" 
2022-08-19T21:44:41.591 INFO:teuthology.orchestra.run.smithi124.stdout:}

Mon detected "standby mgr started" at 2022-08-19T21:45:23, Why ?

2022-08-19T21:45:23.494 INFO:journalctl@ceph.mon.smithi124.smithi124.stdout:Aug 19 21:45:23 smithi124 ceph-2641a2e4-2007-11ed-8431-001a4aab830c-mon.smithi124[39035]: cluster 2022-08-19T21:45:22.453271+0000 mon.smithi124 (mon.0) 580 : cluster [DBG] Standby manager daemon smithi138.iazeci started

There was no standby mgr at 2022-08-19T21:45:22 which caused upgrade pause :

2022-08-19T21:45:22.591+0000 7f0703abe700  0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci container digest correct
2022-08-19T21:45:22.591+0000 7f0703abe700  0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi138.iazeci not deployed by correct version
2022-08-19T21:45:22.592+0000 7f0703abe700  0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon

There was no beacon sent by standby mgr for more than 33 seconds

2022-08-19T21:44:41.334+0000 7fcd105d5700 10 mgr operator() config_callback: osd_pool_default_pg_autoscale_mode :
2022-08-19T21:44:42.369+0000 7fcd0d5cf700 10 mgr tick tick
2022-08-19T21:44:42.369+0000 7fcd0d5cf700 20 mgr send_beacon standby
2022-08-19T21:44:42.369+0000 7fcd0d5cf700 10 mgr send_beacon sending beacon as gid 14198
2022-08-19T21:44:42.369+0000 7fcd0d5cf700  1 -- 172.21.15.138:0/1169015240 --> [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] -- mgrbeacon mgr.smithi138.iazeci(2641a2e4-2007-11ed-8431-001a4aab830c,14198, , 0) v10 -- 0x564f5b7c5680 con 0x564f53f50c00
...
<no logs till 2022-08-19T21:45:15>

standby mgr closed mon socket at 2022-08-19T21:45:15 :

2022-08-19T21:45:15.317+0000 7fd92d4ff680  1 -- 172.21.15.138:0/2446933073 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x559e576dcc00 msgr2=0x559e57769e00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2022-08-19T21:45:15.317+0000 7fd92d4ff680  1 --2- 172.21.15.138:0/2446933073 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x559e576dcc00 0x559e57769e00 secure :-1 s=READY pgs=42 cs=0 l=1 rev1=1 rx=0x559e5771c180 tx=0x559e577d07e0).stop
2022-08-19T21:45:15.317+0000 7fd92d4ff680  1 -- 172.21.15.138:0/2446933073 shutdown_connections
2022-08-19T21:45:15.317+0000 7fd92d4ff680  1 --2- 172.21.15.138:0/2446933073 >> [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] conn(0x559e576dd000 0x559e5776a300 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rev1=1 rx=0 tx=0).stop
2022-08-19T21:45:15.317+0000 7fd92d4ff680  1 --2- 172.21.15.138:0/2446933073 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x559e576dcc00 0x559e57769e00 unknown :-1 s=CLOSED pgs=42 cs=0 l=1 rev1=1 rx=0 tx=0).stop
2022-08-19T21:45:15.317+0000 7fd92d4ff680  1 -- 172.21.15.138:0/2446933073 >> 172.21.15.138:0/2446933073 conn(0x559e576dc800 msgr2=0x559e583ee000 unknown :-1 s=STATE_NONE l=0).mark_down
2022-08-19T21:45:15.317+0000 7fd92d4ff680  1 -- 172.21.15.138:0/2446933073 shutdown_connections
2022-08-19T21:45:15.317+0000 7fd92d4ff680  1 -- 172.21.15.138:0/2446933073 wait complete

and restarted at 2022-08-19T21:45:15 :

2022-08-19T21:45:15.317+0000 7fd92d4ff680  0 set uid:gid to 167:167 (ceph:ceph)
2022-08-19T21:45:15.317+0000 7fd92d4ff680  0 ceph version 16.2.10-668-g67571cd8 (67571cd8e74ee004835e626e065891d2f91abffa) pacific (stable), process ceph-mgr, pid 7
2022-08-19T21:45:15.317+0000 7fd92d4ff680  0 pidfile_write: ignore empty --pid-file
2022-08-19T21:45:15.318+0000 7fd92d4ff680  1  Processor -- start
2022-08-19T21:45:15.318+0000 7fd92d4ff680  1 --  start start
2022-08-19T21:45:15.318+0000 7fd92d4ff680  4 mgr init Registered monc callback

From mon logs, it looks like standby mgr closed the socket connection 2022-08-19T21:44:42 :

2022-08-19T21:44:42.371+0000 7f5bc2229700 10 mon.smithi124@0(leader).paxosservice(mgr 1..17) dispatch 0x55bf3fba2280 mgrbeacon mgr.smithi138.iazeci(2641a2e4-2007-11ed-8431-001a4aab830c,14198, , 0) v10 from mgr.? 172.21.15.138:0/1169015240 con 0x55bf3e73bc00
2022-08-19T21:44:42.371+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 no_reply to mgr.? 172.21.15.138:0/1169015240 mgrbeacon mgr.smithi138.iazeci(2641a2e4-2007-11ed-8431-001a4aab830c,14198, , 0) v10
2022-08-19T21:44:42.371+0000 7f5bc2229700 20 is_capable service=mgr command= exec addr 172.21.15.138:0/1169015240 on cap allow profile mgr
2022-08-19T21:44:42.900+0000 7f5bc6a32700  1 -- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 msgr2=0x55bf3d8eb700 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 36
2022-08-19T21:44:42.900+0000 7f5bc6a32700  1 -- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 msgr2=0x55bf3d8eb700 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2022-08-19T21:44:42.900+0000 7f5bc6a32700  1 --2- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 0x55bf3d8eb700 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 rx=0x55bf3e8dd680 tx=0x55bf3e9c0900).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2022-08-19T21:44:42.900+0000 7f5bc6a32700  1 --2- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] >> 172.21.15.138:0/1169015240 conn(0x55bf3e73bc00 0x55bf3d8eb700 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 rx=0x55bf3e8dd680 tx=0x55bf3e9c0900).stop
2022-08-19T21:44:42.900+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 ms_handle_reset 0x55bf3e73bc00 172.21.15.138:0/1169015240
2022-08-19T21:44:42.900+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 reset/close on session mgr.? 172.21.15.138:0/1169015240
2022-08-19T21:44:42.900+0000 7f5bc2229700 10 mon.smithi124@0(leader) e2 remove_session 0x55bf3efe0000 mgr.? 172.21.15.138:0/1169015240 features 0x3f01cfb9fffdffff

Not sure what could be the reason for standby mgr to close the socket connection with mon here as there are no debug_ms logs for standby mgr.

This is the function (in Pacific) that's checking for an upgrade.
src/pybind/mgr/cephadm/upgrade.py
[...]

We are likely hitting old tracker https://tracker.ceph.com/issues/54419 here for pacific release though the changes related to PR#45361 which fixes tracker#54419 has been backported to pacific in backport PR#45716 (https://github.com/ceph/ceph/pull/45716).

I tried to reproduce the issue 5 times with the same testcase running in --interactive-on-error mode but issue was not reproducible. We will need to revisit this bug again if reproducible on pacific.

Actions #5

Updated by Prashant D over 1 year ago

It looks like corner case to me where standby mgr somehow closing the connection with mon during mgr upgrade which causes this issue. We likely need to increase timeout for the testcase to detect the standby mgr to continue with mgr upgrade.

Actions #6

Updated by Laura Flores over 1 year ago

/a/amathuri-2022-09-08_14:19:23-rados-wip-yuri5-testing-2022-09-06-1334-pacific-distro-default-smithi/7020752

Actions #7

Updated by Laura Flores over 1 year ago

  • Translation missing: en.field_tag_list set to test-failure
Actions #8

Updated by Laura Flores over 1 year ago

  • Backport set to pacific

/a/yuriw-2022-09-20_17:39:55-rados-wip-yuri5-testing-2022-09-19-1007-pacific-distro-default-smithi/7039447

Actions #9

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-03-13_19:57:13-rados-wip-yuri6-testing-2023-03-12-0918-pacific-distro-default-smithi/7205777

Actions #10

Updated by Laura Flores 12 months ago

/a/lflores-2023-05-03_19:12:23-rados-wip-yuri5-testing-2023-04-25-0837-pacific-distro-default-smithi/7261585

Actions #11

Updated by Venky Shankar 7 months ago

  • Related to Bug #54419: `ceph orch upgrade start` seems to never reach completion added
Actions

Also available in: Atom PDF