Bug #48442: cephadm: upgrade loops on mixed x86_64/arm64 cluster - Orchestrator - Ceph

Actions

Copy link

Bug #48442

closed

cephadm: upgrade loops on mixed x86_64/arm64 cluster

Added by Bryan Stillwell over 3 years ago. Updated about 2 years ago.

Status:

Closed

Priority:

Low

Assignee:

Category:

cephadm

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.7

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When I tried to use 'ceph orch upgrade start --ceph-version 15.2.7' to upgrade my home cluster from 15.2.5 to 15.2.7, it got stuck in a loop because I have a mixture of x86_64 and arm64 nodes:

2020-12-01T16:47:26.761950-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons...
2020-12-01T16:47:26.769581-0700 mgr.aladdin.liknom [INF] Upgrade: All mgr daemons are up to date.
2020-12-01T16:47:26.770096-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mon daemons...
2020-12-01T16:47:28.800426-0700 mgr.aladdin.liknom [INF] Upgrade: All mon daemons are up to date.
2020-12-01T16:47:28.800878-0700 mgr.aladdin.liknom [INF] Upgrade: Checking crash daemons...
2020-12-01T16:47:28.851819-0700 mgr.aladdin.liknom [INF] Upgrade: Setting container_image for all crash...
2020-12-01T16:47:28.855595-0700 mgr.aladdin.liknom [INF] Upgrade: All crash daemons are up to date.
2020-12-01T16:47:28.856283-0700 mgr.aladdin.liknom [INF] Upgrade: Checking osd daemons...
2020-12-01T16:47:31.348345-0700 mgr.aladdin.liknom [INF] Upgrade: Pulling docker.io/ceph/ceph:v15.2.7 on mandalaybay
2020-12-01T16:47:35.311065-0700 mgr.aladdin.liknom [INF] Upgrade: image docker.io/ceph/ceph:v15.2.7 pull on mandalaybay got new image 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959 (not 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e), restarting
2020-12-01T16:47:35.534893-0700 mgr.aladdin.liknom [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.7 with id 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959
2020-12-01T16:47:35.546444-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons...
2020-12-01T16:47:35.547185-0700 mgr.aladdin.liknom [INF] Upgrade: Need to upgrade myself (mgr.aladdin.liknom)
2020-12-01T16:47:37.506337-0700 mgr.aladdin.liknom [INF] Upgrade: Pulling docker.io/ceph/ceph:v15.2.7 on ether
2020-12-01T16:47:40.770290-0700 mgr.aladdin.liknom [INF] Upgrade: image docker.io/ceph/ceph:v15.2.7 pull on ether got new image 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e (not 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959), restarting
2020-12-01T16:47:41.172402-0700 mgr.aladdin.liknom [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.7 with id 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e
2020-12-01T16:47:41.226550-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons...
2020-12-01T16:47:41.230932-0700 mgr.aladdin.liknom [INF] Upgrade: All mgr daemons are up to date.
2020-12-01T16:47:41.231887-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mon daemons...
2020-12-01T16:47:43.179844-0700 mgr.aladdin.liknom [INF] Upgrade: All mon daemons are up to date.
2020-12-01T16:47:43.180305-0700 mgr.aladdin.liknom [INF] Upgrade: Checking crash daemons...
2020-12-01T16:47:43.187481-0700 mgr.aladdin.liknom [INF] Upgrade: Setting container_image for all crash...
2020-12-01T16:47:43.191821-0700 mgr.aladdin.liknom [INF] Upgrade: All crash daemons are up to date.
2020-12-01T16:47:43.192290-0700 mgr.aladdin.liknom [INF] Upgrade: Checking osd daemons...
2020-12-01T16:47:45.692126-0700 mgr.aladdin.liknom [INF] Upgrade: Pulling docker.io/ceph/ceph:v15.2.7 on mandalaybay
2020-12-01T16:47:50.679789-0700 mgr.aladdin.liknom [INF] Upgrade: image docker.io/ceph/ceph:v15.2.7 pull on mandalaybay got new image 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959 (not 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e), restarting

There are only 14 OSDs in this cluster, with 12 of them on x86_64 nodes, and 2 on a 8GB raspberry pi 4 (which is named mandalaybay).

This appears to be where the problem is:

https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/upgrade.py#L297

The image_id and target_id won't ever match because they are different on each architecture.

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

Hm. Not sure about this. Even if we fix this, how are we supposed to make sure, we're not introducing any regressions without actually having mixed architecture clusters in Sepia.

Workaround is to manually upgrade the daemons and circumventing the upgrade.py

Actions

Copy link

Updated by Bryan Stillwell over 3 years ago

How often do the images for the same release change though? Couldn't checking if all the images are v15.2.7 be good enough?

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

Bryan Stillwell wrote:

How often do the images for the same release change though? Couldn't checking if all the images are v15.2.7 be good enough?

not going to work. We'll have to deal with many container images per ceph version. Like e.g. if there is a security update for any library that is installed in the container image.

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

Project changed from Ceph to Orchestrator
Category set to cephadm

Actions

Copy link

Updated by Bryan Stillwell over 3 years ago

How would I manually upgrade the remaining daemons? I'm not finding anything in the documentation about how to do this. I tried reading through the code and it seems like redeploying the service is the way to do it, but when I try the following:

ceph orch redeploy osd.0

it doesn't appear to do anything.

Actions

Copy link

Updated by Bryan Stillwell over 3 years ago

I figured out how to the upgrade manually. In case anyone else runs into this problem (and finds this bug), you can upgrade individual OSDs with:

ceph orch daemon redeploy osd.0 ceph/ceph:v15.2.8

Actions

Copy link

Updated by Sebastian Wagner about 3 years ago

Description updated (diff)
Priority changed from Normal to Low

right now, this is somewhat low on our priority list. But in pacific, this should be improved by using repo_digest for doing the upgrade instead of container ids.

Actions

Copy link