Bug #50441: cephadm bootstrap on arm64 fails to start ceph/ceph-grafana service - RADOS - Ceph

Actions

Copy link

Bug #50441

closed

cephadm bootstrap on arm64 fails to start ceph/ceph-grafana service

Added by M B about 3 years ago. Updated almost 1 year ago.

Status:

Rejected

Priority:

Normal

Assignee:

Dan Mick

Category:

Target version:

% Done:

100%

Source:

Tags:

Backport:

octopus, pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.10

ceph-qa-suite:

Component(RADOS):

Pull request ID:

41559

Crash signature (v1):

Crash signature (v2):

Description

Hello,

I installed a new Ceph 15.2.10 cluster on Ubuntu 20.04 arm64 bare metal starting with a first monitor/manager node using the new "cephadm bootstrap" tool using the following command:

cephadm bootstrap --mon-ip 192.168.1.11

but unfortunately the grafana service is not working at all. It tries to restart the ceph/ceph-grafana container every 10 minutes but fails to do so because it looks like there is no arm64 version of this container as you can see from the logs below:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1021, in _remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1168, in _run_cephadm
    code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Deploy daemon grafana.ceph1a ...
Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host --net=host --entrypoint stat -e CONTAINER_IMAGE=docker.io/ceph/ceph-grafana:6.7.4 -e NODE_NAME=ceph1a docker.io/ceph/ceph-grafana:6.7.4 -c %u %g /var/lib/grafana
stat: stderr {"msg":"exec container process `/usr/bin/stat`: Exec format error","level":"error","time":"2021-04-09T06:17:54.000910863Z"}
Traceback (most recent call last):
  File "<stdin>", line 6153, in <module>
  File "<stdin>", line 1412, in _default_image
  File "<stdin>", line 3431, in command_deploy
  File "<stdin>", line 3362, in extract_uid_gid_monitoring
  File "<stdin>", line 2099, in extract_uid_gid
RuntimeError: uid/gid not found

So I see two options here:

1) provide an arm64 docker image for the ceph/ceph-grafana container (preferred)
2) check for arm64 arch and do not deploy the grafana service on this architecture until 1) is fixed

I think it is a real win for Ceph to fully work on arm64 architecture, so it would be great if this could be taken care of. In case you need more details or more log data do not hesitate to contact me.

Thank you very much in advance.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

Category changed from cephadm to cephadm/monitoring

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

Status changed from New to Fix Under Review
Assignee set to Dan Mick
Pull request ID set to 41559

Actions

Copy link

Updated by Kefu Chai almost 3 years ago

Status changed from Fix Under Review to Pending Backport
Backport set to octopus, pacific

Actions

Copy link

Updated by Kefu Chai almost 3 years ago

Copied to Backport #51549: pacific: cephadm bootstrap on arm64 fails to start ceph/ceph-grafana service added

Actions

Copy link

Updated by Kefu Chai almost 3 years ago

Copied to Backport #51551: octopus: cephadm bootstrap on arm64 fails to start ceph/ceph-grafana service added

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Updated by Deepika Upadhyay almost 3 years ago

Status changed from Resolved to Pending Backport

Actions

Copy link

Updated by Deepika Upadhyay almost 3 years ago

Project changed from Orchestrator to RADOS
Category deleted (~~cephadm/monitoring~~)

moved temp to RADOS so that we can use backport scripts

Actions

Copy link

Updated by Deepika Upadhyay almost 3 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

#10

Updated by M B over 2 years ago

Unfortunately this issue does not seem to be resolved, or at least not with Pacific 16.2.5. I installed a fresh new cluster with "cephadm boostrap --mon-ip <IP>" and it is stuck at "Updating prometheus deployment" as you can see below from "ceph -s" output:


  cluster:
    id:     fb48d256-f43d-11eb-9f74-7fd39d4b232f
    health: HEALTH_WARN
            OSD count 0 < osd_pool_default_size 3

  services:
    mon: 1 daemons, quorum ceph1a (age 76m)
    mgr: no daemons active (since 64m)
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     

  progress:
    Updating prometheus deployment (+1 -> 1) (0s)
      [............................]

The web admin interface was working for a the first few minutes after bootstrapping but then stopped and commands such as "ceph orch host ls" just stall and never give any output back.

This is Ubuntu 20.04 LTS as host on aarch64.

Let me know if you need anymore details.

Actions

Copy link

#11

Updated by Dan Mick over 2 years ago

Can't reproduce the failure; I just started a mon-and-mgr bootstrapped cluster with no incident:

# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT
alertmanager   ?:9093,9094      1/1  40s ago    6m   count:1
crash                           1/1  40s ago    7m   *
grafana        ?:3000           1/1  40s ago    6m   count:1
mgr                             1/2  40s ago    7m   count:2
mon                             1/5  40s ago    7m   count:5
node-exporter  ?:9100           1/1  40s ago    6m   *
prometheus     ?:9095           1/1  40s ago    6m   count:1

The base OS was CentOS 8, but that shouldn't matter. I guess we need to know why prometheus update was failing. Are there any hints in /var/log/ceph/cephadm.log?

Actions

Copy link

#12

Updated by Loïc Dachary over 2 years ago

Status changed from Resolved to Pending Backport

Actions

Copy link

#13

Updated by Loïc Dachary over 2 years ago

Deepika, you marked this issue resolved but I can't figure out why, would you be so kind as to explain ? Thanks in advance !

Actions

Copy link

#14

Updated by Deepika Upadhyay over 2 years ago

@Loïc Dachary, sure, the PR addressing this issue was backported to pacific, spoke to Dan that octopus backport is not necessary. So I marked it as resolved after pacific and master merge.

Actions

Copy link

#15

Updated by Deepika Upadhyay over 2 years ago

Status changed from Pending Backport to Need More Info

M B wrote:

Unfortunately this issue does not seem to be resolved, or at least not with Pacific 16.2.5. I installed a fresh new cluster

Actions

Copy link

#16

Updated by M B over 2 years ago

@Deepika finally I think this issue I mentioned last week regarding the prometheus deployment after a new cluster installation with Pacific is unrelated because I simply rebooted the node and finally prometheus got deployed. So it is an issue that it gets stuck on bootstrapping the node but it is not related to this specific issue. I have been reading on the ceph-users that some other users are having similar issues with services which should be deployed simply get stuck, probably and hopefully there is another track issue open for that. From my side all good.

Actions

Copy link

#17