Bug #62625: [cephadm] send wrong platform image to new host in mixed cluster - Ceph - Ceph

Actions

Copy link

Bug #62625

open

[cephadm] send wrong platform image to new host in mixed cluster

Added by Steffen Schulze 8 months ago. Updated 8 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v17.2.6

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

i have a cephadm managed cluster of 8 raspi arm64 hosts and fail to add a new amd64 host. but the new host gets the arm64 ceph image from the cephadm cluster.

am i doing something wrong or is this a bug?

starting point is a cluster with the current quay.io/ceph/ceph:v17 image:

root@rpi-230:/# ceph orch ps
NAME                   HOST     PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
alertmanager.rpi-230   rpi-230  *:9093,9094  running (20h)    15s ago   4w    21.6M        -  0.23.0   44a71f29f42b  6d8ae44964a3
crash.rpi-230          rpi-230               running (20h)    15s ago   4w    6597k        -  17.2.6   57fbebda1a97  a5c535dfabfc
crash.rpi-231          rpi-231               running (3d)      6m ago   4w    1404k        -  17.2.6   57fbebda1a97  9c728442c29a
crash.rpi-232          rpi-232               running (3d)     11s ago   3w    1396k        -  17.2.6   57fbebda1a97  7ea3704373dc
crash.rpi-233          rpi-233               running (3d)     96s ago   3w    1409k        -  17.2.6   57fbebda1a97  0a576e5231cf
crash.rpi-234          rpi-234               running (20h)     6m ago   2w    6597k        -  17.2.6   57fbebda1a97  759c0af92220
crash.rpi-235          rpi-235               running (3d)      7s ago   2w    1422k        -  17.2.6   57fbebda1a97  1a52ea801ebc
crash.rpi-236          rpi-236               running (3d)     10m ago  12d    1404k        -  17.2.6   57fbebda1a97  02c7a2aa3c83
crash.rpi-237          rpi-237               running (4h)     14s ago   5d    1413k        -  17.2.6   57fbebda1a97  d4a4b722597c
grafana.rpi-230        rpi-230  *:3000       running (20h)    15s ago   4w    51.2M        -  8.3.5    046209d1c628  29d121a4384a
mgr.rpi-230.zbxkes     rpi-230  *:8443,9283  running (20h)    15s ago   4w     624M        -  17.2.6   57fbebda1a97  bfc0fe15a129
mgr.rpi-234.pdstqj     rpi-234  *:8443,9283  running (20h)     6m ago   4d     435M        -  17.2.6   57fbebda1a97  350e2a6e8311
mgr.rpi-237.hpwnva     rpi-237  *:8443,9283  running (4h)     14s ago   3d    44.9M        -  17.2.6   57fbebda1a97  7f7694a403d5
mon.rpi-230            rpi-230               running (20h)    15s ago   4w     465M    2048M  17.2.6   57fbebda1a97  070c50abc00c
mon.rpi-234            rpi-234               running (20h)     6m ago   6d     454M    2048M  17.2.6   57fbebda1a97  218c88023f19
mon.rpi-236            rpi-236               running (3d)     10m ago   3d    79.2M    2048M  17.2.6   57fbebda1a97  52e89a42bdcf
node-exporter.rpi-230  rpi-230  *:9100       running (20h)    15s ago   4w    16.4M        -  1.3.1    bb203ba967a8  67665e0d1566
node-exporter.rpi-231  rpi-231  *:9100       running (2d)      6m ago   4w    13.0M        -  1.3.1    bb203ba967a8  99bb60f5ba98
node-exporter.rpi-232  rpi-232  *:9100       running (2d)     11s ago   3w    15.7M        -  1.3.1    bb203ba967a8  29f2fa4e05c6
node-exporter.rpi-233  rpi-233  *:9100       running (2d)     96s ago   3w    13.9M        -  1.3.1    bb203ba967a8  6d076336d074
node-exporter.rpi-234  rpi-234  *:9100       running (20h)     6m ago   2w    16.8M        -  1.3.1    bb203ba967a8  707cf4bd09c9
node-exporter.rpi-235  rpi-235  *:9100       running (2d)      7s ago   2w    13.2M        -  1.3.1    bb203ba967a8  5445a2b2da70
node-exporter.rpi-236  rpi-236  *:9100       running (2d)     10m ago  12d    13.2M        -  1.3.1    bb203ba967a8  58f3d098817a
node-exporter.rpi-237  rpi-237  *:9100       running (4h)     14s ago   5d    14.1M        -  1.3.1    bb203ba967a8  55535407d4b8
osd.0                  rpi-230               running (20h)    15s ago   3w    1687M    4096M  17.2.6   57fbebda1a97  5bf81cc9d2b7
osd.1                  rpi-235               running (8h)      7s ago   3d     502M    4096M  17.2.6   57fbebda1a97  fafbcba1db58
osd.2                  rpi-231               running (18h)     6m ago   3w     237M    4096M  17.2.6   57fbebda1a97  87574a61f57a
osd.3                  rpi-237               running (4h)     14s ago   3d     499M    4096M  17.2.6   57fbebda1a97  f93631975dda
osd.4                  rpi-231               running (18h)     6m ago   3w     283M    4096M  17.2.6   57fbebda1a97  108d54376319
osd.5                  rpi-233               running (22h)    96s ago   2w     271M    4096M  17.2.6   57fbebda1a97  89c6dd95e87b
osd.8                  rpi-233               running (8h)     96s ago   7d     289M    4096M  17.2.6   57fbebda1a97  95f187a1eea9
osd.9                  rpi-234               running (20h)     6m ago   2w    1928M    4096M  17.2.6   57fbebda1a97  53b3a296e6b5
osd.11                 rpi-236               running (51m)    10m ago   7d     432M    4096M  17.2.6   57fbebda1a97  d57e1233c7a2
osd.12                 rpi-232               running (25m)    11s ago  11d     459M    4096M  17.2.6   57fbebda1a97  b59afb9ed360
prometheus.rpi-230     rpi-230  *:9095       running (20h)    15s ago   4w     133M        -  2.33.4   49058af74c32  2bb58e1efaf7

this xeon-238 has following local images cached:

root@xeon-238:~# podman images
REPOSITORY                        TAG         IMAGE ID      CREATED        SIZE
quay.io/ceph/ceph                 v17         99cefc773578  2 days ago     1.29 GB
quay.io/prometheus/node-exporter  v1.3.1      1dbe0e931976  21 months ago  22.3 MB

now i try to add a amd64 host named xeon-238 with:

root@rpi-230:/# ceph orch host add xeon-238 192.168.3.238
Added host 'xeon-238' with addr '192.168.3.238'

the cluster logs show following message

8/29/23 12:35:13 PM
[ERR]
Failed while placing crash.xeon-238 on xeon-238: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-5e689148-2f27-11ee-9d7e-b827ebbe7ce9-crash-xeon-238 /usr/bin/podman: stderr Error: inspecting object: no such container ceph-5e689148-2f27-11ee-9d7e-b827ebbe7ce9-crash-xeon-238 Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-5e689148-2f27-11ee-9d7e-b827ebbe7ce9-crash.xeon-238 /usr/bin/podman: stderr Error: inspecting object: no such container ceph-5e689148-2f27-11ee-9d7e-b827ebbe7ce9-crash.xeon-238 Deploy daemon crash.xeon-238 ... Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint stat --init -e CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c -e NODE_NAME=xeon-238 -e CEPH_USE_RANDOM_NONCE=1 quay.io/ceph/ceph@sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c -c %u %g /var/lib/ceph stat: stderr WARNING: image platform ({arm64 linux [] }) does not match the expected platform ({amd64 linux [] }) stat: stderr ERROR (catatonit:2): failed to exec pid1: Exec format error ERROR: Failed to extract uid/gid for path /var/lib/ceph: Failed command: /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint stat --init -e CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c -e NODE_NAME=xeon-238 -e CEPH_USE_RANDOM_NONCE=1 quay.io/ceph/ceph@sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c -c %u %g /var/lib/ceph: WARNING: image platform ({arm64 linux [] }) does not match the expected platform ({amd64 linux [] }) ERROR (catatonit:2): failed to exec pid1: Exec format error

after this the xeon-238 has following images cached:

root@xeon-238:~# podman images
REPOSITORY                        TAG         IMAGE ID      CREATED        SIZE
quay.io/ceph/ceph                 v17         99cefc773578  2 days ago     1.29 GB
quay.io/ceph/ceph                 <none>      57fbebda1a97  6 weeks ago    1.2 GB
quay.io/prometheus/node-exporter  v1.3.1      1dbe0e931976  21 months ago  22.3 MB

the 57fb images seems to be the arm image instead of the 99ce amd64 image

root@xeon-238:~# podman manifest inspect quay.io/ceph/ceph:v17
{
    "schemaVersion": 2,
    "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
    "manifests": [
        {
            "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
            "size": 743,
            "digest": "sha256:7aca73f4708ae6898efdf06dcde8cbea01a3a551a5772293f0d0f32f8da1fccb",
            "platform": {
                "architecture": "amd64",
                "os": "linux" 
            }
        },
        {
            "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
            "size": 743,
            "digest": "sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c",
            "platform": {
                "architecture": "arm64",
                "os": "linux",
                "variant": "v8" 
            }
        }
    ]
}

Actions

Copy link

Updated by John Mulligan 8 months ago

I don't know if ceph supports mixed architecture clusters. I'm pretty certain it's not tested. Assuming this isn't a production cluster you may want to try:

ceph config set mgr mgr/cephadm/use_repo_digest false

Then make sure the incorrect images are purged from the x86_64 node, and then retry adding it to the cluster. If this helps then the problem is probably due to how cephadm tries to use a consistent image across the cluster by converting the image & tag to a digest reference.

Actions

Copy link

Updated by Steffen Schulze 8 months ago

thank you, that was the hint that helped me.
after changing the global container_image from digest to tag, i was able to add the amd64 host without any problems.

$ ceph config set mgr mgr/cephadm/use_repo_digest false
$ ceph config set global container_image quay.io/ceph/ceph:v17

it will be interesting to see how the next upgrade goes. but at least i understand the underlying mechanism better now.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #62625

[cephadm] send wrong platform image to new host in mixed cluster

Updated by John Mulligan 8 months ago

Updated by Steffen Schulze 8 months ago