Bug #50441
closedcephadm bootstrap on arm64 fails to start ceph/ceph-grafana service
100%
Description
Hello,
I installed a new Ceph 15.2.10 cluster on Ubuntu 20.04 arm64 bare metal starting with a first monitor/manager node using the new "cephadm bootstrap" tool using the following command:
cephadm bootstrap --mon-ip 192.168.1.11
but unfortunately the grafana service is not working at all. It tries to restart the ceph/ceph-grafana container every 10 minutes but fails to do so because it looks like there is no arm64 version of this container as you can see from the logs below:
Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 1021, in _remote_connection yield (conn, connr) File "/usr/share/ceph/mgr/cephadm/module.py", line 1168, in _run_cephadm code, '\n'.join(err))) orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Deploy daemon grafana.ceph1a ... Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host --net=host --entrypoint stat -e CONTAINER_IMAGE=docker.io/ceph/ceph-grafana:6.7.4 -e NODE_NAME=ceph1a docker.io/ceph/ceph-grafana:6.7.4 -c %u %g /var/lib/grafana stat: stderr {"msg":"exec container process `/usr/bin/stat`: Exec format error","level":"error","time":"2021-04-09T06:17:54.000910863Z"} Traceback (most recent call last): File "<stdin>", line 6153, in <module> File "<stdin>", line 1412, in _default_image File "<stdin>", line 3431, in command_deploy File "<stdin>", line 3362, in extract_uid_gid_monitoring File "<stdin>", line 2099, in extract_uid_gid RuntimeError: uid/gid not found
So I see two options here:
1) provide an arm64 docker image for the ceph/ceph-grafana container (preferred)
2) check for arm64 arch and do not deploy the grafana service on this architecture until 1) is fixed
I think it is a real win for Ceph to fully work on arm64 architecture, so it would be great if this could be taken care of. In case you need more details or more log data do not hesitate to contact me.
Thank you very much in advance.
Updated by Sebastian Wagner almost 3 years ago
- Category changed from cephadm to cephadm/monitoring
Updated by Sebastian Wagner almost 3 years ago
- Status changed from New to Fix Under Review
- Assignee set to Dan Mick
- Pull request ID set to 41559
Updated by Kefu Chai almost 3 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to octopus, pacific
Updated by Kefu Chai almost 3 years ago
- Copied to Backport #51549: pacific: cephadm bootstrap on arm64 fails to start ceph/ceph-grafana service added
Updated by Kefu Chai almost 3 years ago
- Copied to Backport #51551: octopus: cephadm bootstrap on arm64 fails to start ceph/ceph-grafana service added
Updated by Sebastian Wagner almost 3 years ago
- Status changed from Pending Backport to Resolved
Updated by Deepika Upadhyay almost 3 years ago
- Status changed from Resolved to Pending Backport
Updated by Deepika Upadhyay almost 3 years ago
- Project changed from Orchestrator to RADOS
- Category deleted (
cephadm/monitoring)
moved temp to RADOS so that we can use backport scripts
Updated by Deepika Upadhyay almost 3 years ago
- Status changed from Pending Backport to Resolved
Updated by M B over 2 years ago
Unfortunately this issue does not seem to be resolved, or at least not with Pacific 16.2.5. I installed a fresh new cluster with "cephadm boostrap --mon-ip <IP>" and it is stuck at "Updating prometheus deployment" as you can see below from "ceph -s" output:
cluster: id: fb48d256-f43d-11eb-9f74-7fd39d4b232f health: HEALTH_WARN OSD count 0 < osd_pool_default_size 3 services: mon: 1 daemons, quorum ceph1a (age 76m) mgr: no daemons active (since 64m) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: progress: Updating prometheus deployment (+1 -> 1) (0s) [............................]
The web admin interface was working for a the first few minutes after bootstrapping but then stopped and commands such as "ceph orch host ls" just stall and never give any output back.
This is Ubuntu 20.04 LTS as host on aarch64.
Let me know if you need anymore details.
Updated by Dan Mick over 2 years ago
Can't reproduce the failure; I just started a mon-and-mgr bootstrapped cluster with no incident:
# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 40s ago 6m count:1 crash 1/1 40s ago 7m * grafana ?:3000 1/1 40s ago 6m count:1 mgr 1/2 40s ago 7m count:2 mon 1/5 40s ago 7m count:5 node-exporter ?:9100 1/1 40s ago 6m * prometheus ?:9095 1/1 40s ago 6m count:1
The base OS was CentOS 8, but that shouldn't matter. I guess we need to know why prometheus update was failing. Are there any hints in /var/log/ceph/cephadm.log?
Updated by Loïc Dachary over 2 years ago
- Status changed from Resolved to Pending Backport
Updated by Loïc Dachary over 2 years ago
Deepika, you marked this issue resolved but I can't figure out why, would you be so kind as to explain ? Thanks in advance !
Updated by Deepika Upadhyay over 2 years ago
@Loïc Dachary, sure, the PR addressing this issue was backported to pacific, spoke to Dan that octopus backport is not necessary. So I marked it as resolved after pacific and master merge.
Updated by Deepika Upadhyay over 2 years ago
- Status changed from Pending Backport to Need More Info
M B wrote:
Unfortunately this issue does not seem to be resolved, or at least not with Pacific 16.2.5. I installed a fresh new cluster
Updated by M B over 2 years ago
@Deepika finally I think this issue I mentioned last week regarding the prometheus deployment after a new cluster installation with Pacific is unrelated because I simply rebooted the node and finally prometheus got deployed. So it is an issue that it gets stuck on bootstrapping the node but it is not related to this specific issue. I have been reading on the ceph-users that some other users are having similar issues with services which should be deployed simply get stuck, probably and hopefully there is another track issue open for that. From my side all good.
Updated by Neha Ojha over 2 years ago
Deepika: why is this issue in need-more-info? Looks like the original fix and pacific backport https://github.com/ceph/ceph/pull/42211 have merged?
Updated by Dan Mick over 2 years ago
I assume because of MB's comment, but that seems now to be historical
Updated by Deepika Upadhyay over 2 years ago
- Status changed from Need More Info to Resolved
Dan Mick wrote:
Deepika, was that the reason why?
yep Dan, Neha marked needs info because of MB's comment, marking it as resolved since that's no longer valid, feel free to reopen if otherwise
Updated by Konstantin Shalygin almost 1 year ago
- Status changed from Resolved to Rejected
- % Done changed from 0 to 100