Bug #53693: ceph orch upgrade start is getting stuck in gibba cluster - Orchestrator - Ceph

Actions

Copy link

Bug #53693

closed

ceph orch upgrade start is getting stuck in gibba cluster

Added by Vikhyat Umrao over 2 years ago. Updated 3 months ago.

Status:

Closed

Priority:

Normal

Assignee:

Adam King

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

- The current ceph version

# ceph versions
{
    "mon": {
        "ceph version 17.0.0-9475-g8ea352e9 (8ea352e994feffca1bfd357a20c491df01db91a9) quincy (dev)": 5
    },
    "mgr": {
        "ceph version 17.0.0-9475-g8ea352e9 (8ea352e994feffca1bfd357a20c491df01db91a9) quincy (dev)": 2
    },
    "osd": {
        "ceph version 17.0.0-9475-g8ea352e9 (8ea352e994feffca1bfd357a20c491df01db91a9) quincy (dev)": 970
    },
    "mds": {
        "ceph version 17.0.0-9475-g8ea352e9 (8ea352e994feffca1bfd357a20c491df01db91a9) quincy (dev)": 2
    },
    "overall": {
        "ceph version 17.0.0-9475-g8ea352e9 (8ea352e994feffca1bfd357a20c491df01db91a9) quincy (dev)": 979
    }
}

The version we were trying to upgrade:

{
    "needs_update": {
        "crash.gibba001": {
            "current_id": "f79fcb826d512859ef4914712095ea7ee02622fc213f5c39ab7b2ec468965efd",
            "current_name": "quay.ceph.io/ceph-ci/ceph@sha256:14b1ea54031bea23a37c589a02be794dca9c5a0807116ffef655bea631f9a62e",
            "current_version": "17.0.0-9475-g8ea352e9" 
        },
        "crash.gibba002": {
            "current_id": "f79fcb826d512859ef4914712095ea7ee02622fc213f5c39ab7b2ec468965efd",
            "current_name": "quay.ceph.io/ceph-ci/ceph@sha256:14b1ea54031bea23a37c589a02be794dca9c5a0807116ffef655bea631f9a62e",
            "current_version": "17.0.0-9475-g8ea352e9" 
        },

........
........

    ],
    "target_digest": "quay.ceph.io/ceph-ci/ceph@sha256:465e18548d5a9e1155bd093dfaa894e3cbc8f5b2e5a3d22b22c73a7979664155",
    "target_id": "9081735aa97cbfd10601ab1fc5fcaed6c8b41c2b22517b73c297aab304e5ffdd",
    "target_name": "quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa",
    "target_version": "ceph version 17.0.0-9718-g4ff72306 (4ff723061fc15c803dcf6556d02f56bdf56de5fa) quincy (dev)",
    "up_to_date": []
}

- Upgrade start and status with debug log enabled

[root@gibba001 ~]# ceph config set mgr mgr/cephadm/log_level debug

[root@gibba001 ~]# ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa
Initiating upgrade to quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa

[root@gibba001 ~]# ceph orch upgrade status
{
    "target_image": "quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa",
    "in_progress": true,
    "services_complete": [],
    "progress": "",
}

- Ceph MGR Logs:

2021-12-21T21:34:47.490+0000 7fd28c7cb700  0 log_channel(audit) log [DBG] : from='client.17948814 -' entity='client.admin' cmd=[{"prefix": "orch upgrade start", "image": "quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa", "target": ["mon-mgr", ""]}]: dispatch

2021-12-21T21:34:47.492+0000 7fd28cfcc700  0 [cephadm INFO root] Upgrade: Started with target quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa

2021-12-21T21:34:47.492+0000 7fd28cfcc700  0 log_channel(cephadm) log [INF] : Upgrade: Started with target quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa

2021-12-21T21:34:47.492+0000 7fd28cfcc700  0 [progress INFO root] update: starting ev 668dc33f-3fca-4bd9-9ca7-0b926137fd71 (Upgrade to quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa)

- In debug logs, we have only the following, maybe nothing related to upgrade?

2021-12-21T21:34:48.142+0000 7fd286f00700  0 [progress INFO root] Processing OSDMap change 44389..44389
2021-12-21T21:34:50.813+0000 7fd25c16f700  0 [cephadm DEBUG root] Refreshed host gibba026 daemons (28)
2021-12-21T21:34:50.819+0000 7fd25d171700  0 [cephadm DEBUG root] Refreshed host gibba027 daemons (28)
2021-12-21T21:34:50.847+0000 7fd28b7c9700  0 log_channel(cluster) log [DBG] : pgmap v98: 65553 pgs: 1 active+clean+scrubbing+deep, 65552 active+clean; 992 GiB data, 4.1 TiB used, 8.8 TiB / 13 TiB avail
2021-12-21T21:34:51.075+0000 7fd25d171700  0 [cephadm DEBUG root] Received up-to-date metadata from agent on host gibba027.
2021-12-21T21:34:51.077+0000 7fd25c16f700  0 [cephadm DEBUG root] Received up-to-date metadata from agent on host gibba026.
2021-12-21T21:34:51.083+0000 7fd25a96c700  0 [cephadm DEBUG root] Refreshed host gibba023 daemons (28)
2021-12-21T21:34:51.237+0000 7fd25a96c700  0 [cephadm DEBUG root] Received up-to-date metadata from agent on host gibba023.
2021-12-21T21:34:51.483+0000 7fd25996a700  0 [cephadm DEBUG root] Refreshed host gibba030 daemons (28)
2021-12-21T21:34:51.585+0000 7fd25a16b700  0 [cephadm DEBUG root] Refreshed host gibba029 daemons (28)
2021-12-21T21:34:51.604+0000 7fd25996a700  0 [cephadm DEBUG root] Received up-to-date metadata from agent on host gibba030.
2021-12-21T21:34:51.696+0000 7fd25a16b700  0 [cephadm DEBUG root] Received up-to-date metadata from agent on host gibba029.
2021-12-21T21:34:51.841+0000 7fd25d972700  0 [cephadm DEBUG root] Refreshed host gibba031 daemons (28)
2021-12-21T21:34:51.954+0000 7fd259169700  0 [cephadm DEBUG root] Refreshed host gibba032 daemons (28)
2021-12-21T21:34:51.996+0000 7fd25d972700  0 [cephadm DEBUG root] Received up-to-date metadata from agent on host gibba031.
2021-12-21T21:34:52.087+0000 7fd259169700  0 [cephadm DEBUG root] Received up-to-date metadata from agent on host gibba032.

- Ceph staus

# ceph -s
  cluster:
    id:     182eef00-53b5-11ec-84d3-3cecef3d8fb8
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum gibba001,gibba002,gibba004,gibba005,gibba006 (age 3h)
    mgr: gibba001.zptzqf(active, since 7m), standbys: gibba002.veobjs
    mds: 1/1 daemons up, 1 standby
    osd: 1073 osds: 970 up (since 14m), 970 in (since 23h)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 65553 pgs
    objects: 230.34M objects, 992 GiB
    usage:   4.1 TiB used, 8.8 TiB / 13 TiB avail
    pgs:     65553 active+clean

  progress:
    Upgrade to quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa (0s)
      [............................]

- In Progress bar of ceph status upgrade status is stuck always as following we have given approx more than 15+hours to this upgrade to move forward but no luch

 progress:
    Upgrade to quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa (0s)
      [............................]

Actions

Copy link

Updated by Vikhyat Umrao over 2 years ago

We have another tiny (3 nodes) cluster and the same command was tried there and it worked there.
The main difference was this tiny cluster was upgraded from pacific 16.2.0 to the latest 17.0.0-9718-g4ff72306.

- The version after upgrade:

[root@dell-per630-8 log]# ceph versions
{
    "mon": {
        "ceph version 17.0.0-9718-g4ff72306 (4ff723061fc15c803dcf6556d02f56bdf56de5fa) quincy (dev)": 3
    },
    "mgr": {
        "ceph version 17.0.0-9718-g4ff72306 (4ff723061fc15c803dcf6556d02f56bdf56de5fa) quincy (dev)": 3
    },
    "osd": {
        "ceph version 17.0.0-9718-g4ff72306 (4ff723061fc15c803dcf6556d02f56bdf56de5fa) quincy (dev)": 3
    },
    "mds": {},
    "rgw": {
        "ceph version 17.0.0-9718-g4ff72306 (4ff723061fc15c803dcf6556d02f56bdf56de5fa) quincy (dev)": 3
    },
    "overall": {
        "ceph version 17.0.0-9718-g4ff72306 (4ff723061fc15c803dcf6556d02f56bdf56de5fa) quincy (dev)": 12
    }
}

- MGR Logs from this cluster:


Dec 20 19:12:27 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: audit 2021-12-21T00:12:26.876320+0000 mgr.dell-per630-12.jxdrni (mgr.44105) 256089 : audit [DBG] from='client.80304 -' entity='client.admin' cmd=[{"prefix": "orch upgrade start", "image": "quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa", "target": ["mon-mgr", ""]}]: dispatch

Dec 20 19:12:27 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: Upgrade: Started with target quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa

 "Here in this cluster as we can see it is moving forward" 

Dec 20 19:12:27 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: 256091 : cephadm [INF] Upgrade: First pull of quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa

Dec 20 19:14:10 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:14:06.303449+0000 mgr.dell-per630-12.jxdrni (mgr.44105) 256142 : cephadm [INF] Upgrade: Target is version 17.0.0-9718-g4ff72306 (unknown)
Dec 20 19:14:10 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:14:06.303558+0000 mgr.dell-per630-12.jxdrni (mgr.44105) 256143 : cephadm [INF] Upgrade: Target container is quay.ceph.io/ceph-ci/ceph@sha256:465e18548d5a9e1155bd093dfaa894e3cbc8f5b2e5a3d22b22c73a7979664155, digests ['quay.ceph.io/ceph-ci/ceph@sha256:465e18548d5a9e1155bd093dfaa894e3cbc8f5b2e5a3d22b22c73a7979664155']
Dec 20 19:14:10 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:14:06.306022+0000 mgr.dell-per630-12.jxdrni (mgr.44105) 256144 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.dell-per630-12.jxdrni)
Dec 20 19:14:10 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:14:06.742211+0000 mgr.dell-per630-12.jxdrni (mgr.44105) 256145 : cephadm [INF] Upgrade: Pulling quay.ceph.io/ceph-ci/ceph@sha256:465e18548d5a9e1155bd093dfaa894e3cbc8f5b2e5a3d22b22c73a7979664155 on dell-per630-10.gsslab.pnq2.redhat.com
Dec 20 19:15:45 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: ) 256194 : cephadm [INF] Upgrade: Updating mgr.dell-per630-10.awdoqx
Dec 20 19:16:30 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:16:28.915524+0000 mgr.dell-per630-12.jxdrni (mgr.44105) 256219 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.dell-per630-12.jxdrni)
Dec 20 19:16:33 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:16:30.885835+0000 mgr.dell-per630-12.jxdrni (mgr.44105) 256221 : cephadm [INF] Upgrade: Updating mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa
Dec 20 19:16:47 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:16:45.308629+0000 mgr.dell-per630-12.jxdrni (mgr.44105) 256230 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.dell-per630-12.jxdrni)
Dec 20 19:17:14 dell-per630-8.gsslab.pnq2.redhat.com conmon[1580233]: debug 2021-12-21T00:17:14.552+0000 7f77890c2700  0 [cephadm INFO cephadm.upgrade] Upgrade: Need to upgrade myself (mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa)
Dec 20 19:17:14 dell-per630-8.gsslab.pnq2.redhat.com conmon[1580233]: debug 2021-12-21T00:17:14.552+0000 7f77890c2700  0 log_channel(cephadm) log [INF] : Upgrade: Need to upgrade myself (mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa)
Dec 20 19:17:14 dell-per630-8.gsslab.pnq2.redhat.com conmon[1580233]: debug 2021-12-21T00:17:14.553+0000 7f77890c2700  0 [progress INFO root] update: starting ev 6791834e-60e2-4fcf-a2e7-d722750f5ef9 (Upgrade to 17.0.0-9718-g4ff72306)
Dec 20 19:17:15 dell-per630-8.gsslab.pnq2.redhat.com conmon[1580233]: debug 2021-12-21T00:17:15.201+0000 7f77890c2700  0 [cephadm INFO cephadm.upgrade] Upgrade: Pulling quay.ceph.io/ceph-ci/ceph@sha256:465e18548d5a9e1155bd093dfaa894e3cbc8f5b2e5a3d22b22c73a7979664155 on dell-per630-12.gsslab.pnq2.redhat.com
Dec 20 19:17:15 dell-per630-8.gsslab.pnq2.redhat.com conmon[1580233]: debug 2021-12-21T00:17:15.201+0000 7f77890c2700  0 log_channel(cephadm) log [INF] : Upgrade: Pulling quay.ceph.io/ceph-ci/ceph@sha256:465e18548d5a9e1155bd093dfaa894e3cbc8f5b2e5a3d22b22c73a7979664155 on dell-per630-12.gsslab.pnq2.redhat.com
Dec 20 19:17:17 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: Upgrade: Need to upgrade myself (mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa)
Dec 20 19:17:17 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]:  mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa (mgr.80994) 25 : cephadm [INF] Upgrade: Pulling quay.ceph.io/ceph-ci/ceph@sha256:465e18548d5a9e1155bd093dfaa894e3cbc8f5b2e5a3d22b22c73a7979664155 on dell-per630-12.gsslab.pnq2.redhat.com
Dec 20 19:18:49 dell-per630-8.gsslab.pnq2.redhat.com conmon[1580233]: debug 2021-12-21T00:18:49.194+0000 7f77890c2700  0 [cephadm INFO cephadm.upgrade] Upgrade: Updating mgr.dell-per630-12.jxdrni
Dec 20 19:18:49 dell-per630-8.gsslab.pnq2.redhat.com conmon[1580233]: debug 2021-12-21T00:18:49.194+0000 7f77890c2700  0 log_channel(cephadm) log [INF] : Upgrade: Updating mgr.dell-per630-12.jxdrni
Dec 20 19:18:54 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa (mgr.80994) 73 : cephadm [INF] Upgrade: Updating mgr.dell-per630-12.jxdrni
Dec 20 19:20:04 dell-per630-8.gsslab.pnq2.redhat.com conmon[1580233]: debug 2021-12-21T00:20:04.449+0000 7f77890c2700  0 [cephadm INFO cephadm.upgrade] Upgrade: Need to upgrade myself (mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa)
Dec 20 19:20:04 dell-per630-8.gsslab.pnq2.redhat.com conmon[1580233]: debug 2021-12-21T00:20:04.449+0000 7f77890c2700  0 log_channel(cephadm) log [INF] : Upgrade: Need to upgrade myself (mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa)
Dec 20 19:20:05 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]:  Upgrade: Need to upgrade myself (mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa)
Dec 20 19:20:37 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:20:35.606995+0000 mgr.dell-per630-10.awdoqx (mgr.55702) 27 : cephadm [INF] Upgrade: Need to upgrade myself (mgr.dell-per630-10.awdoqx)
Dec 20 19:21:04 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: +0000 mgr.dell-per630-12.jxdrni (mgr.56399) 24 : cephadm [INF] Upgrade: Updating mgr.dell-per630-10.awdoqx
Dec 20 19:21:22 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:21:22.713656+0000 mgr.dell-per630-12.jxdrni (mgr.56399) 35 : cephadm [INF] Upgrade: Updating mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa
Dec 20 19:21:44 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: mgr.dell-per630-12.jxdrni (mgr.56399) 47 : cephadm [INF] Upgrade: Setting container_image for all mgr
Dec 20 19:21:44 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]:  : cephadm [INF] Upgrade: It appears safe to stop mon.dell-per630-10
Dec 20 19:21:46 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: [INF] Upgrade: Updating mon.dell-per630-10
Dec 20 19:22:05 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: [INF] Upgrade: It appears safe to stop mon.dell-per630-12
Dec 20 19:22:07 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: 56399) 64 : cephadm [INF] Upgrade: Updating mon.dell-per630-12
Dec 20 19:22:40 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: cephadm 2021-12-21T00:22:39.109370+0000 mgr.dell-per630-12.jxdrni (mgr.56399) 82 : cephadm [INF] Upgrade: It appears safe to stop mon.dell-per630-8.gsslab.pnq2.redhat.com
Dec 20 19:22:43 dell-per630-8.gsslab.pnq2.redhat.com conmon[10762]: 84 : cephadm [INF] Upgrade: Updating mon.dell-per630-8.gsslab.pnq2.redhat.com
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Need to upgrade myself (mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa)
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Pulling quay.ceph.io/ceph-ci/ceph@sha256:465e18548d5a9e1155bd093dfaa894e3cbc8f5b2e5a3d22b22c73a7979664155 on dell-per630-12.gsslab.pnq2.redhat.com
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating mgr.dell-per630-12.jxdrni
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Need to upgrade myself (mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa)
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Need to upgrade myself (mgr.dell-per630-10.awdoqx)
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating mgr.dell-per630-10.awdoqx
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating mgr.dell-per630-8.gsslab.pnq2.redhat.com.aitioa
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting container_image for all mgr
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: It appears safe to stop mon.dell-per630-10
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating mon.dell-per630-10
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: It appears safe to stop mon.dell-per630-12
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating mon.dell-per630-12
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: It appears safe to stop mon.dell-per630-8.gsslab.pnq2.redhat.com
Dec 20 19:22:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating mon.dell-per630-8.gsslab.pnq2.redhat.com
Dec 20 19:24:48 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting container_image for all mon
Dec 20 19:24:51 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating crash.dell-per630-10
Dec 20 19:25:09 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating crash.dell-per630-12
Dec 20 19:25:32 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating crash.dell-per630-8
Dec 20 19:25:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting container_image for all crash
Dec 20 19:25:55 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: osd.1 is safe to restart
Dec 20 19:25:58 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating osd.1
Dec 20 19:26:24 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: unsafe to stop osd(s) at this time (145 PGs are or would become offline)
Dec 20 19:26:39 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: unsafe to stop osd(s) at this time (145 PGs are or would become offline)
Dec 20 19:26:54 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: unsafe to stop osd(s) at this time (98 PGs are or would become offline)
Dec 20 19:27:09 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: unsafe to stop osd(s) at this time (4 PGs are or would become offline)
Dec 20 19:27:26 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: osd.0 is safe to restart
Dec 20 19:27:28 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating osd.0
Dec 20 19:27:56 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: unsafe to stop osd(s) at this time (145 PGs are or would become offline)
Dec 20 19:28:11 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: unsafe to stop osd(s) at this time (145 PGs are or would become offline)
Dec 20 19:28:26 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: unsafe to stop osd(s) at this time (45 PGs are or would become offline)
Dec 20 19:28:41 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: osd.2 is safe to restart
Dec 20 19:28:43 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating osd.2
Dec 20 19:29:18 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting container_image for all osd
Dec 20 19:29:19 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting require_osd_release to 17 quincy
Dec 20 19:29:19 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting container_image for all mds
Dec 20 19:29:23 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating rgw.rgws.dell-per630-10.yrlfkq
Dec 20 19:29:52 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating rgw.rgws.dell-per630-12.xlodjk
Dec 20 19:30:17 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating rgw.rgws.dell-per630-8.uhlueb
Dec 20 19:30:41 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting container_image for all rgw
Dec 20 19:30:43 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting container_image for all rbd-mirror
Dec 20 19:30:43 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting container_image for all iscsi
Dec 20 19:30:43 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Setting container_image for all nfs
Dec 20 19:30:46 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating node-exporter.dell-per630-10
Dec 20 19:31:17 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating node-exporter.dell-per630-12
Dec 20 19:31:48 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating node-exporter.dell-per630-8
Dec 20 19:32:24 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating prometheus.dell-per630-8
Dec 20 19:33:41 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating alertmanager.dell-per630-8
Dec 20 19:34:35 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Updating grafana.dell-per630-8
Dec 20 19:36:31 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Finalizing container_image settings
Dec 20 19:36:37 dell-per630-8.gsslab.pnq2.redhat.com ceph-mon[1597242]: Upgrade: Complete!

Actions

Copy link

Updated by Vikhyat Umrao over 2 years ago

Status changed from New to Closed

Discussion from #ceph-gibba channel!


<AdKing> vikhyat message from before while you were not in channel: "<AdKing> I had a look last night. It seems like the serve loop runs a little bit and then just stops. I tried turning the agent off in case it was the cause but it didn't help. I think the next thing to try is removing gibba045 from the cluster (there seemed to be issues happening on that
<AdKing> host) and doing a mgr failover and see what happens but I haven't gotten to it yet." 
<vikhyat> thank you for the update 
<vikhyat> sure let us know if anything needed from core team side
<AdKing> vikhyat can't pull the image the gibba cluster was supposed to be upgrading too anymore. Do you have a new candidate image?
<AdKing> podman pull quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa
<AdKing> Trying to pull quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa...
<AdKing> WARN[0000] failed, retrying in 1s ... (1/3). Error: initializing source docker://quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa: reading manifest 4ff723061fc15c803dcf6556d02f56bdf56de5fa in quay.ceph.io/ceph-ci/ceph: unknown: Tag 4ff723061fc15c803dcf6556d02f56bdf56de5fa was deleted or has expired. To pull, revive via time
<AdKing> machine 
<AdKing> WARN[0001] failed, retrying in 1s ... (2/3). Error: initializing source docker://quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa: reading manifest 4ff723061fc15c803dcf6556d02f56bdf56de5fa in quay.ceph.io/ceph-ci/ceph: unknown: Tag 4ff723061fc15c803dcf6556d02f56bdf56de5fa was deleted or has expired. To pull, revive via time
<AdKing> machine 
<AdKing> WARN[0002] failed, retrying in 1s ... (3/3). Error: initializing source docker://quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa: reading manifest 4ff723061fc15c803dcf6556d02f56bdf56de5fa in quay.ceph.io/ceph-ci/ceph: unknown: Tag 4ff723061fc15c803dcf6556d02f56bdf56de5fa was deleted or has expired. To pull, revive via time
<AdKing> machine 
<AdKing> Error: initializing source docker://quay.ceph.io/ceph-ci/ceph:4ff723061fc15c803dcf6556d02f56bdf56de5fa: reading manifest 4ff723061fc15c803dcf6556d02f56bdf56de5fa in quay.ceph.io/ceph-ci/ceph: unknown: Tag 4ff723061fc15c803dcf6556d02f56bdf56de5fa was deleted or has expired. To pull, revive via time machine
<vikhyat> AdKing: let me check with neha 
<vikhyat> neha: any new image which we want to use to upgrade gibba
<neha> any latest image based on master should be fine
<neha> how about https://quay.ceph.io/repository/ceph-ci/ceph?tag=24df4a2e83c57788f5638aad0e88a738d80878ef&tab=tags
<vikhyat> sure
<vikhyat> AdKing: quay.ceph.io/ceph-ci/ceph:24df4a2e83c57788f5638aad0e88a738d80878ef

<neha> can we try to upgrade the cluster now? or has somebody already tried? 
<vikhyat> neha: looks like AdKing knows the some of it why it is happening but not getting time to look into it so I was asking him if we can take help from Paul Cruzner - Adam is saying he will check with him
<vikhyat> issue is still not fixed 
<vikhyat> I hope you saw yesterday note from Adam
<neha> ok cool, thanks vikhyat and AdKing!
<neha> awesome, the mgr is now running f2313edc67106699e6ab91f50fa91928e579f7ac, fyi yaarit
<yaarit> neha: cool, the report is generated well now :-) 
<AdKing> I removed gibba045 from the cluster and the upgrade seems to be going along. All the mgr and mons are done and it's going through the crash daemons (which is slow as it repeatedly has to pull the image as it upgrades the first daemon on each host). If you ignore the stray daemon/host warnings from gibba045 it looks okay for now.
<neha> AdKing: sounds great, thanks again!
<neha> yaarit: yay!
<vikhyat> woot! thanks AdKing 

<vikhyat> looks like now it is working on OSD's
<vikhyat>     "osd": {
<vikhyat>         "ceph version 17.0.0-9475-g8ea352e9 (8ea352e994feffca1bfd357a20c491df01db91a9) quincy (dev)": 783,
<vikhyat>         "ceph version 17.0.0-9964-gf2313edc (f2313edc67106699e6ab91f50fa91928e579f7ac) quincy (dev)": 216

<vikhyat> AdKing: https://tracker.ceph.com/issues/53693 maybe you want to update this one and close out?
<AdKing> vikhyat should we fully close it? The upgrade itself was able to go but we still don't fully understand what the root cause was that was blocking it (other than that removing gibba045 allowed it to work)
<vikhyat> AdKing: let me do one thing add all the chat discussion in the tracker and for now close it and we can reopen if we hit the issue again? as this gibba045 has been problematic node - not sure now it will help anything to troubleshoot from upgrade point of view as the upgrade is already completed?
<AdKing> vikhyat that works for me. I expect you will hit it again if you re add gibba045 and then try another upgrade
<vikhyat> AdKing: I think redeploying the node will help?
<AdKing> possibly, worth a try
<vikhyat> yep this is what we were thinking 
<vikhyat> because that OSD node has 90+ OSD's which are not being used
<vikhyat> so could be some stale stuff causing issue
<vikhyat> thanks again let me update the tracker

Actions

Copy link

Updated by Vikhyat Umrao over 2 years ago

Assignee set to Adam King

Actions

Copy link

Updated by Laura Flores 3 months ago

Could be similar. Something to keep an eye on..

/a/lflores-2024-01-10_23:43:40-rados-wip-yuri11-testing-2024-01-10-1124-pacific-distro-default-smithi/7512978

2024-01-11T01:08:50.699 INFO:journalctl@ceph.mon.a.smithi053.stdout:Jan 11 01:08:50 smithi053 ceph-7c4bee00-b01c-11ee-95ab-87774f69a715-mon-a[62314]: cephadm 2024-01-11T01:08:49.245290+0000 mgr.y (mgr.24856) 72 : cephadm [INF] Upgrade: unsafe to stop osd(s) at this time (17 PGs are or would become offline)
...
2024-01-11T01:09:15.548 INFO:journalctl@ceph.mgr.y.smithi053.stdout:Jan 11 01:09:15 smithi053 ceph-7c4bee00-b01c-11ee-95ab-87774f69a715-mgr-y[60279]: debug 2024-01-11T01:09:15.456+0000 7ff653bfc700 -1 mgr.server reply reply (16) Device or resource busy unsafe to stop osd(s) at this time (10 PGs are or would become offline)
...
2024-01-11T01:09:15.978 INFO:journalctl@ceph.mon.b.smithi107.stdout:Jan 11 01:09:15 smithi107 ceph-7c4bee00-b01c-11ee-95ab-87774f69a715-mon-b[55805]: cephadm 2024-01-11T01:09:15.457956+0000 mgr.y (mgr.24856) 92 : cephadm [INF] Upgrade: unsafe to stop osd(s) at this time (10 PGs are or would become offline)
2024-01-11T01:09:16.244 DEBUG:teuthology.orchestra.run:got remote process result: 1
2024-01-11T01:09:16.245 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_cd45576300487d997e5a85abed65500b9f5d143b/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_cd45576300487d997e5a85abed65500b9f5d143b/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_0a3eac1a626f3d1b1c90d29d6e12b1470fcdd190/qa/tasks/cephadm.py", line 1058, in shell
    _shell(ctx, cluster_name, remote,
  File "/home/teuthworker/src/git.ceph.com_ceph-c_0a3eac1a626f3d1b1c90d29d6e12b1470fcdd190/qa/tasks/cephadm.py", line 34, in _shell
    return remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_cd45576300487d997e5a85abed65500b9f5d143b/teuthology/orchestra/remote.py", line 523, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_cd45576300487d997e5a85abed65500b9f5d143b/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_cd45576300487d997e5a85abed65500b9f5d143b/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_cd45576300487d997e5a85abed65500b9f5d143b/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed on smithi053 with status 1: 'sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/ceph:v15.2.0 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 7c4bee00-b01c-11ee-95ab-87774f69a715 -e sha1=0a3eac1a626f3d1b1c90d29d6e12b1470fcdd190 -- bash -c \'ceph versions | jq -e \'"\'"\'.overall | length == 1\'"\'"\'\''
2024-01-11T01:09:16.468 ERROR:teuthology.util.sentry: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=c6f2842c55fe4a99a46f7d511792899c

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #53693

ceph orch upgrade start is getting stuck in gibba cluster

Updated by Vikhyat Umrao over 2 years ago

Updated by Vikhyat Umrao over 2 years ago

Updated by Vikhyat Umrao over 2 years ago

Updated by Laura Flores 3 months ago