Project

General

Profile

Actions

Bug #48442

closed

cephadm: upgrade loops on mixed x86_64/arm64 cluster

Added by Bryan Stillwell over 3 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Low
Assignee:
-
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When I tried to use 'ceph orch upgrade start --ceph-version 15.2.7' to upgrade my home cluster from 15.2.5 to 15.2.7, it got stuck in a loop because I have a mixture of x86_64 and arm64 nodes:

2020-12-01T16:47:26.761950-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons...
2020-12-01T16:47:26.769581-0700 mgr.aladdin.liknom [INF] Upgrade: All mgr daemons are up to date.
2020-12-01T16:47:26.770096-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mon daemons...
2020-12-01T16:47:28.800426-0700 mgr.aladdin.liknom [INF] Upgrade: All mon daemons are up to date.
2020-12-01T16:47:28.800878-0700 mgr.aladdin.liknom [INF] Upgrade: Checking crash daemons...
2020-12-01T16:47:28.851819-0700 mgr.aladdin.liknom [INF] Upgrade: Setting container_image for all crash...
2020-12-01T16:47:28.855595-0700 mgr.aladdin.liknom [INF] Upgrade: All crash daemons are up to date.
2020-12-01T16:47:28.856283-0700 mgr.aladdin.liknom [INF] Upgrade: Checking osd daemons...
2020-12-01T16:47:31.348345-0700 mgr.aladdin.liknom [INF] Upgrade: Pulling docker.io/ceph/ceph:v15.2.7 on mandalaybay
2020-12-01T16:47:35.311065-0700 mgr.aladdin.liknom [INF] Upgrade: image docker.io/ceph/ceph:v15.2.7 pull on mandalaybay got new image 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959 (not 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e), restarting
2020-12-01T16:47:35.534893-0700 mgr.aladdin.liknom [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.7 with id 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959
2020-12-01T16:47:35.546444-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons...
2020-12-01T16:47:35.547185-0700 mgr.aladdin.liknom [INF] Upgrade: Need to upgrade myself (mgr.aladdin.liknom)
2020-12-01T16:47:37.506337-0700 mgr.aladdin.liknom [INF] Upgrade: Pulling docker.io/ceph/ceph:v15.2.7 on ether
2020-12-01T16:47:40.770290-0700 mgr.aladdin.liknom [INF] Upgrade: image docker.io/ceph/ceph:v15.2.7 pull on ether got new image 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e (not 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959), restarting
2020-12-01T16:47:41.172402-0700 mgr.aladdin.liknom [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.7 with id 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e
2020-12-01T16:47:41.226550-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons...
2020-12-01T16:47:41.230932-0700 mgr.aladdin.liknom [INF] Upgrade: All mgr daemons are up to date.
2020-12-01T16:47:41.231887-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mon daemons...
2020-12-01T16:47:43.179844-0700 mgr.aladdin.liknom [INF] Upgrade: All mon daemons are up to date.
2020-12-01T16:47:43.180305-0700 mgr.aladdin.liknom [INF] Upgrade: Checking crash daemons...
2020-12-01T16:47:43.187481-0700 mgr.aladdin.liknom [INF] Upgrade: Setting container_image for all crash...
2020-12-01T16:47:43.191821-0700 mgr.aladdin.liknom [INF] Upgrade: All crash daemons are up to date.
2020-12-01T16:47:43.192290-0700 mgr.aladdin.liknom [INF] Upgrade: Checking osd daemons...
2020-12-01T16:47:45.692126-0700 mgr.aladdin.liknom [INF] Upgrade: Pulling docker.io/ceph/ceph:v15.2.7 on mandalaybay
2020-12-01T16:47:50.679789-0700 mgr.aladdin.liknom [INF] Upgrade: image docker.io/ceph/ceph:v15.2.7 pull on mandalaybay got new image 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959 (not 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e), restarting

There are only 14 OSDs in this cluster, with 12 of them on x86_64 nodes, and 2 on a 8GB raspberry pi 4 (which is named mandalaybay).

This appears to be where the problem is:

https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/upgrade.py#L297

The image_id and target_id won't ever match because they are different on each architecture.

Actions #1

Updated by Sebastian Wagner over 3 years ago

Hm. Not sure about this. Even if we fix this, how are we supposed to make sure, we're not introducing any regressions without actually having mixed architecture clusters in Sepia.

Workaround is to manually upgrade the daemons and circumventing the upgrade.py

Actions #2

Updated by Bryan Stillwell over 3 years ago

How often do the images for the same release change though? Couldn't checking if all the images are v15.2.7 be good enough?

Actions #3

Updated by Sebastian Wagner over 3 years ago

Bryan Stillwell wrote:

How often do the images for the same release change though? Couldn't checking if all the images are v15.2.7 be good enough?

not going to work. We'll have to deal with many container images per ceph version. Like e.g. if there is a security update for any library that is installed in the container image.

Actions #4

Updated by Sebastian Wagner over 3 years ago

  • Project changed from Ceph to Orchestrator
  • Category set to cephadm
Actions #5

Updated by Bryan Stillwell over 3 years ago

How would I manually upgrade the remaining daemons? I'm not finding anything in the documentation about how to do this. I tried reading through the code and it seems like redeploying the service is the way to do it, but when I try the following:

ceph orch redeploy osd.0

it doesn't appear to do anything.

Actions #6

Updated by Bryan Stillwell over 3 years ago

I figured out how to the upgrade manually. In case anyone else runs into this problem (and finds this bug), you can upgrade individual OSDs with:

ceph orch daemon redeploy osd.0 ceph/ceph:v15.2.8

Actions #7

Updated by Sebastian Wagner about 3 years ago

  • Description updated (diff)
  • Priority changed from Normal to Low

right now, this is somewhat low on our priority list. But in pacific, this should be improved by using repo_digest for doing the upgrade instead of container ids.

Actions #8

Updated by Sebastian Wagner almost 3 years ago

  • Status changed from New to Need More Info

This might work now. We're now using repo_dist. Which might work across architectures

Actions #9

Updated by Bryan Stillwell almost 3 years ago

Which versions should this work in? Octopus v15.2.12 and Pacific v16.2.4? Or just Pacific?

Actions #10

Updated by Bryan Stillwell almost 3 years ago

This appears to be fixed, but I need to wait for v16.2.5 to come out to confirm completely because of this bug which is causing illegal instructions:

https://tracker.ceph.com/issues/50579

Actions #11

Updated by Redouane Kachach Elhichou about 2 years ago

Using repo_dist seems to fix this (old) issue. Closing as for long-time of inactivity.

Actions #12

Updated by Redouane Kachach Elhichou about 2 years ago

  • Status changed from Need More Info to Closed
Actions

Also available in: Atom PDF