Bug #62625
open[cephadm] send wrong platform image to new host in mixed cluster
0%
Description
i have a cephadm managed cluster of 8 raspi arm64 hosts and fail to add a new amd64 host. but the new host gets the arm64 ceph image from the cephadm cluster.
am i doing something wrong or is this a bug?
starting point is a cluster with the current quay.io/ceph/ceph:v17 image:
root@rpi-230:/# ceph orch ps NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.rpi-230 rpi-230 *:9093,9094 running (20h) 15s ago 4w 21.6M - 0.23.0 44a71f29f42b 6d8ae44964a3 crash.rpi-230 rpi-230 running (20h) 15s ago 4w 6597k - 17.2.6 57fbebda1a97 a5c535dfabfc crash.rpi-231 rpi-231 running (3d) 6m ago 4w 1404k - 17.2.6 57fbebda1a97 9c728442c29a crash.rpi-232 rpi-232 running (3d) 11s ago 3w 1396k - 17.2.6 57fbebda1a97 7ea3704373dc crash.rpi-233 rpi-233 running (3d) 96s ago 3w 1409k - 17.2.6 57fbebda1a97 0a576e5231cf crash.rpi-234 rpi-234 running (20h) 6m ago 2w 6597k - 17.2.6 57fbebda1a97 759c0af92220 crash.rpi-235 rpi-235 running (3d) 7s ago 2w 1422k - 17.2.6 57fbebda1a97 1a52ea801ebc crash.rpi-236 rpi-236 running (3d) 10m ago 12d 1404k - 17.2.6 57fbebda1a97 02c7a2aa3c83 crash.rpi-237 rpi-237 running (4h) 14s ago 5d 1413k - 17.2.6 57fbebda1a97 d4a4b722597c grafana.rpi-230 rpi-230 *:3000 running (20h) 15s ago 4w 51.2M - 8.3.5 046209d1c628 29d121a4384a mgr.rpi-230.zbxkes rpi-230 *:8443,9283 running (20h) 15s ago 4w 624M - 17.2.6 57fbebda1a97 bfc0fe15a129 mgr.rpi-234.pdstqj rpi-234 *:8443,9283 running (20h) 6m ago 4d 435M - 17.2.6 57fbebda1a97 350e2a6e8311 mgr.rpi-237.hpwnva rpi-237 *:8443,9283 running (4h) 14s ago 3d 44.9M - 17.2.6 57fbebda1a97 7f7694a403d5 mon.rpi-230 rpi-230 running (20h) 15s ago 4w 465M 2048M 17.2.6 57fbebda1a97 070c50abc00c mon.rpi-234 rpi-234 running (20h) 6m ago 6d 454M 2048M 17.2.6 57fbebda1a97 218c88023f19 mon.rpi-236 rpi-236 running (3d) 10m ago 3d 79.2M 2048M 17.2.6 57fbebda1a97 52e89a42bdcf node-exporter.rpi-230 rpi-230 *:9100 running (20h) 15s ago 4w 16.4M - 1.3.1 bb203ba967a8 67665e0d1566 node-exporter.rpi-231 rpi-231 *:9100 running (2d) 6m ago 4w 13.0M - 1.3.1 bb203ba967a8 99bb60f5ba98 node-exporter.rpi-232 rpi-232 *:9100 running (2d) 11s ago 3w 15.7M - 1.3.1 bb203ba967a8 29f2fa4e05c6 node-exporter.rpi-233 rpi-233 *:9100 running (2d) 96s ago 3w 13.9M - 1.3.1 bb203ba967a8 6d076336d074 node-exporter.rpi-234 rpi-234 *:9100 running (20h) 6m ago 2w 16.8M - 1.3.1 bb203ba967a8 707cf4bd09c9 node-exporter.rpi-235 rpi-235 *:9100 running (2d) 7s ago 2w 13.2M - 1.3.1 bb203ba967a8 5445a2b2da70 node-exporter.rpi-236 rpi-236 *:9100 running (2d) 10m ago 12d 13.2M - 1.3.1 bb203ba967a8 58f3d098817a node-exporter.rpi-237 rpi-237 *:9100 running (4h) 14s ago 5d 14.1M - 1.3.1 bb203ba967a8 55535407d4b8 osd.0 rpi-230 running (20h) 15s ago 3w 1687M 4096M 17.2.6 57fbebda1a97 5bf81cc9d2b7 osd.1 rpi-235 running (8h) 7s ago 3d 502M 4096M 17.2.6 57fbebda1a97 fafbcba1db58 osd.2 rpi-231 running (18h) 6m ago 3w 237M 4096M 17.2.6 57fbebda1a97 87574a61f57a osd.3 rpi-237 running (4h) 14s ago 3d 499M 4096M 17.2.6 57fbebda1a97 f93631975dda osd.4 rpi-231 running (18h) 6m ago 3w 283M 4096M 17.2.6 57fbebda1a97 108d54376319 osd.5 rpi-233 running (22h) 96s ago 2w 271M 4096M 17.2.6 57fbebda1a97 89c6dd95e87b osd.8 rpi-233 running (8h) 96s ago 7d 289M 4096M 17.2.6 57fbebda1a97 95f187a1eea9 osd.9 rpi-234 running (20h) 6m ago 2w 1928M 4096M 17.2.6 57fbebda1a97 53b3a296e6b5 osd.11 rpi-236 running (51m) 10m ago 7d 432M 4096M 17.2.6 57fbebda1a97 d57e1233c7a2 osd.12 rpi-232 running (25m) 11s ago 11d 459M 4096M 17.2.6 57fbebda1a97 b59afb9ed360 prometheus.rpi-230 rpi-230 *:9095 running (20h) 15s ago 4w 133M - 2.33.4 49058af74c32 2bb58e1efaf7
this xeon-238 has following local images cached:
root@xeon-238:~# podman images REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/ceph/ceph v17 99cefc773578 2 days ago 1.29 GB quay.io/prometheus/node-exporter v1.3.1 1dbe0e931976 21 months ago 22.3 MB
now i try to add a amd64 host named xeon-238 with:
root@rpi-230:/# ceph orch host add xeon-238 192.168.3.238 Added host 'xeon-238' with addr '192.168.3.238'
the cluster logs show following message
8/29/23 12:35:13 PM [ERR] Failed while placing crash.xeon-238 on xeon-238: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-5e689148-2f27-11ee-9d7e-b827ebbe7ce9-crash-xeon-238 /usr/bin/podman: stderr Error: inspecting object: no such container ceph-5e689148-2f27-11ee-9d7e-b827ebbe7ce9-crash-xeon-238 Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-5e689148-2f27-11ee-9d7e-b827ebbe7ce9-crash.xeon-238 /usr/bin/podman: stderr Error: inspecting object: no such container ceph-5e689148-2f27-11ee-9d7e-b827ebbe7ce9-crash.xeon-238 Deploy daemon crash.xeon-238 ... Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint stat --init -e CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c -e NODE_NAME=xeon-238 -e CEPH_USE_RANDOM_NONCE=1 quay.io/ceph/ceph@sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c -c %u %g /var/lib/ceph stat: stderr WARNING: image platform ({arm64 linux [] }) does not match the expected platform ({amd64 linux [] }) stat: stderr ERROR (catatonit:2): failed to exec pid1: Exec format error ERROR: Failed to extract uid/gid for path /var/lib/ceph: Failed command: /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint stat --init -e CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c -e NODE_NAME=xeon-238 -e CEPH_USE_RANDOM_NONCE=1 quay.io/ceph/ceph@sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c -c %u %g /var/lib/ceph: WARNING: image platform ({arm64 linux [] }) does not match the expected platform ({amd64 linux [] }) ERROR (catatonit:2): failed to exec pid1: Exec format error
after this the xeon-238 has following images cached:
root@xeon-238:~# podman images REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/ceph/ceph v17 99cefc773578 2 days ago 1.29 GB quay.io/ceph/ceph <none> 57fbebda1a97 6 weeks ago 1.2 GB quay.io/prometheus/node-exporter v1.3.1 1dbe0e931976 21 months ago 22.3 MB
the 57fb images seems to be the arm image instead of the 99ce amd64 image
root@xeon-238:~# podman manifest inspect quay.io/ceph/ceph:v17 { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json", "manifests": [ { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 743, "digest": "sha256:7aca73f4708ae6898efdf06dcde8cbea01a3a551a5772293f0d0f32f8da1fccb", "platform": { "architecture": "amd64", "os": "linux" } }, { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 743, "digest": "sha256:16a9765c36d13ff68da32894854da19c9f93148029be31468e1e2097466dea2c", "platform": { "architecture": "arm64", "os": "linux", "variant": "v8" } } ] }
Updated by John Mulligan 8 months ago
I don't know if ceph supports mixed architecture clusters. I'm pretty certain it's not tested. Assuming this isn't a production cluster you may want to try:
ceph config set mgr mgr/cephadm/use_repo_digest false
Then make sure the incorrect images are purged from the x86_64 node, and then retry adding it to the cluster. If this helps then the problem is probably due to how cephadm tries to use a consistent image across the cluster by converting the image & tag to a digest reference.
Updated by Steffen Schulze 8 months ago
thank you, that was the hint that helped me.
after changing the global container_image from digest to tag, i was able to add the amd64 host without any problems.
$ ceph config set mgr mgr/cephadm/use_repo_digest false $ ceph config set global container_image quay.io/ceph/ceph:v17
it will be interesting to see how the next upgrade goes. but at least i understand the underlying mechanism better now.