Project

General

Profile

Actions

Bug #48442

closed

cephadm: upgrade loops on mixed x86_64/arm64 cluster

Added by Bryan Stillwell over 3 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Low
Assignee:
-
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When I tried to use 'ceph orch upgrade start --ceph-version 15.2.7' to upgrade my home cluster from 15.2.5 to 15.2.7, it got stuck in a loop because I have a mixture of x86_64 and arm64 nodes:

2020-12-01T16:47:26.761950-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons...
2020-12-01T16:47:26.769581-0700 mgr.aladdin.liknom [INF] Upgrade: All mgr daemons are up to date.
2020-12-01T16:47:26.770096-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mon daemons...
2020-12-01T16:47:28.800426-0700 mgr.aladdin.liknom [INF] Upgrade: All mon daemons are up to date.
2020-12-01T16:47:28.800878-0700 mgr.aladdin.liknom [INF] Upgrade: Checking crash daemons...
2020-12-01T16:47:28.851819-0700 mgr.aladdin.liknom [INF] Upgrade: Setting container_image for all crash...
2020-12-01T16:47:28.855595-0700 mgr.aladdin.liknom [INF] Upgrade: All crash daemons are up to date.
2020-12-01T16:47:28.856283-0700 mgr.aladdin.liknom [INF] Upgrade: Checking osd daemons...
2020-12-01T16:47:31.348345-0700 mgr.aladdin.liknom [INF] Upgrade: Pulling docker.io/ceph/ceph:v15.2.7 on mandalaybay
2020-12-01T16:47:35.311065-0700 mgr.aladdin.liknom [INF] Upgrade: image docker.io/ceph/ceph:v15.2.7 pull on mandalaybay got new image 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959 (not 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e), restarting
2020-12-01T16:47:35.534893-0700 mgr.aladdin.liknom [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.7 with id 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959
2020-12-01T16:47:35.546444-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons...
2020-12-01T16:47:35.547185-0700 mgr.aladdin.liknom [INF] Upgrade: Need to upgrade myself (mgr.aladdin.liknom)
2020-12-01T16:47:37.506337-0700 mgr.aladdin.liknom [INF] Upgrade: Pulling docker.io/ceph/ceph:v15.2.7 on ether
2020-12-01T16:47:40.770290-0700 mgr.aladdin.liknom [INF] Upgrade: image docker.io/ceph/ceph:v15.2.7 pull on ether got new image 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e (not 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959), restarting
2020-12-01T16:47:41.172402-0700 mgr.aladdin.liknom [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.7 with id 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e
2020-12-01T16:47:41.226550-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons...
2020-12-01T16:47:41.230932-0700 mgr.aladdin.liknom [INF] Upgrade: All mgr daemons are up to date.
2020-12-01T16:47:41.231887-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mon daemons...
2020-12-01T16:47:43.179844-0700 mgr.aladdin.liknom [INF] Upgrade: All mon daemons are up to date.
2020-12-01T16:47:43.180305-0700 mgr.aladdin.liknom [INF] Upgrade: Checking crash daemons...
2020-12-01T16:47:43.187481-0700 mgr.aladdin.liknom [INF] Upgrade: Setting container_image for all crash...
2020-12-01T16:47:43.191821-0700 mgr.aladdin.liknom [INF] Upgrade: All crash daemons are up to date.
2020-12-01T16:47:43.192290-0700 mgr.aladdin.liknom [INF] Upgrade: Checking osd daemons...
2020-12-01T16:47:45.692126-0700 mgr.aladdin.liknom [INF] Upgrade: Pulling docker.io/ceph/ceph:v15.2.7 on mandalaybay
2020-12-01T16:47:50.679789-0700 mgr.aladdin.liknom [INF] Upgrade: image docker.io/ceph/ceph:v15.2.7 pull on mandalaybay got new image 9a0677fecc08d155a8e643b37c6e97d45c04747d9cb9455cafe0a7590d00b959 (not 2bc420ddb175bd1cf9031387948a8812d1bda9ef1180e429b4704e3c06bb943e), restarting

There are only 14 OSDs in this cluster, with 12 of them on x86_64 nodes, and 2 on a 8GB raspberry pi 4 (which is named mandalaybay).

This appears to be where the problem is:

https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/upgrade.py#L297

The image_id and target_id won't ever match because they are different on each architecture.

Actions

Also available in: Atom PDF