Bug #51642: cephadm/rgw : RGW server is not coming up: Initialization timeout, failed to initialize - Orchestrator - Ceph

Actions

Copy link

Bug #51642

closed

cephadm/rgw : RGW server is not coming up: Initialization timeout, failed to initialize

Added by Jiffin Tony Thottan almost 3 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

cephadm/rgw

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have created the ceph cluster via cephadm and it looks fine but when I tried to deploy RGW it was failing and does not much info in logs regarding the failure(says failed to intialise). Any help in debugging will be much appreciated.

Bootstrap ceph cluster via cephadm bootstrap --mon-ip<IP> in fedora 34 VM which uses podman
add host as VM to ceph cluster
add three storage disks and create osd's via ceph orch daemon add osd <host>:*<device-path>*
now deployed rgw via ceph orch apply rgw newstore1 --port=7080 --placement=1

ceph status
cluster:
id: bd183ab8-dfd8-11eb-83cf-525400bf929c
health: HEALTH_WARN
1 failed cephadm daemon(s)
Reduced data availability: 33 pgs inactive
Degraded data redundancy: 33 pgs undersized
services:
mon: 1 daemons, quorum fedora (age 85m)
mgr: fedora.vrlaif(active, since 85m)
osd: 3 osds: 3 up (since 85m), 3 in (since 5d)

data:
pools: 2 pools, 33 pgs
objects: 0 objects, 0 B
usage: 17 MiB used, 30 GiB / 30 GiB avail
pgs: 100.000% pgs not active
33 undersized+peered

ceph osd pool ls
device_health_metrics
.rgw.root

ceph orch ps
NAME HOST PORTS STATUS REFRESHED AGE VERSION IMAGE ID CONTAINER ID
alertmanager.fedora fedora *:9093,9094 running (99m) 7m ago 5d 0.20.0 0881eb8f169f 97a9a4130594
crash.fedora fedora running (99m) 7m ago 5d 16.2.4 8d91d370c2b8 7eab2bef9692
grafana.fedora fedora *:3000 running (99m) 7m ago 5d 6.7.4 ae5c36c3d3cd 5e98deb7ac7a
mgr.fedora.vrlaif fedora *:9283 running (99m) 7m ago 5d 16.2.4 8d91d370c2b8 f33d98b6edd7
mon.fedora fedora running (99m) 7m ago 5d 16.2.4 8d91d370c2b8 c7e1168011c6
node-exporter.fedora fedora *:9100 running (99m) 7m ago 5d 0.18.1 e5a616e4b9cf a397fbac7bb0
osd.0 fedora running (99m) 7m ago 5d 16.2.4 8d91d370c2b8 fc1c1f9ac6dc
osd.1 fedora running (99m) 7m ago 5d 16.2.4 8d91d370c2b8 5379d795c1fb
osd.2 fedora running (99m) 7m ago 5d 16.2.4 8d91d370c2b8 99c5d3447c44
prometheus.fedora fedora *:9095 running (99m) 7m ago 5d 2.18.1 de242295e225 e2e71c2f9bcf
rgw.newstore1.fedora.mkhhgr fedora *:7080 error 7m ago 4d <unknown> <unknown> <unknown>

attaching logs, rgw pod spec and systemctl status of rgw as file

Files

Download all files

systemcctl_status (1.89 KB) systemcctl_status	systemctl status	Jiffin Tony Thottan, 07/13/2021 12:20 PM
latest_podspec (16.7 KB) latest_podspec	podspec	Jiffin Tony Thottan, 07/13/2021 12:20 PM
cephadm.log (38.8 KB) cephadm.log	cephadmlog	Jiffin Tony Thottan, 07/13/2021 12:20 PM
ceph-client.rgw.newstore1.fedora.mkhhgr.log (5.33 KB) ceph-client.rgw.newstore1.fedora.mkhhgr.log	rgwlog	Jiffin Tony Thottan, 07/13/2021 12:21 PM
ceph.tar.xz (704 KB) ceph.tar.xz	full logs	Jiffin Tony Thottan, 07/13/2021 12:24 PM

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

Description updated (diff)

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

Subject changed from cephadm/rgw : RGW server is not coming up to cephadm/rgw : RGW server is not coming up: Initialization timeout, failed to initialize

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

the rgw log looks like so:

2021-07-13T10:39:08.749+0000 7f8f5770b440  0 deferred set uid:gid to 167:167 (ceph:ceph)
2021-07-13T10:39:08.749+0000 7f8f5770b440  0 ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable), process radosgw, pid 2
2021-07-13T10:39:08.749+0000 7f8f5770b440  0 framework: beast
2021-07-13T10:39:08.749+0000 7f8f5770b440  0 framework conf key: port, val: 7080
2021-07-13T10:39:08.749+0000 7f8f5770b440  1 radosgw_Main not setting numa affinity
2021-07-13T10:44:08.750+0000 7f8f4399b700 -1 Initialization timeout, failed to initialize
(repeates 5 times)

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

Try running

ceph orch daemon rm rgw.newstore1.fedora.mkhhgr

If this doesn't help, I will probably need some help form the RGW team.

Actions

Copy link

Updated by Dimitri Savineau almost 3 years ago

    pgs:     100.000% pgs not active
             33 undersized+peered

I'm pretty sure the issue occured because the PGs aren't active+clean.

The default RGW pools (.rgw.root, default.rgw.control, default.rgw.meta and default.rgw.log) should be created during the first RGW instance start.

But if the first RGW pool (.rgw.root) isn't in a correct state then the others aren't created.

@Jiffin Tony Thottan : Would you be able to retest this but, because you have a all-in-one setup, set the osd_pool_default_size variable to 1 before deploying the OSDs ?

ceph config set global osd_pool_default_size 1

Actions

Copy link

Updated by Dimitri Savineau almost 3 years ago

Answering to myself...

# cephadm --image quay.io/ceph/ceph:v16.2.5 bootstrap --mon-ip x.x.x.x --skip-pull --skip-dashboard --skip-monitoring-stack
# cephadm shell ceph config set global mon_warn_on_pool_no_redundancy false
# cephadm shell ceph config set global osd_pool_default_size 1
# cephadm shell ceph orch apply osd --all-available-devices
Scheduled osd.all-available-devices update...
# cephadm shell ceph orch apply rgw newstore1 --port=7080 --placement=1
Scheduled rgw.newstore1 update...
# cephadm shell ceph orch ls --service_type rgw
NAME           PORTS   RUNNING  REFRESHED  AGE  PLACEMENT  
rgw.newstore1  ?:7080      1/1  5m ago     5m   count:1
# cephadm shell ceph orch ps --service_name rgw.newstore1
NAME                          HOST     PORTS   STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
rgw.newstore1.cephaio.xcyqan  cephaio  *:7080  running (5m)     5m ago   5m    15.5M        -  16.2.5   6933c2a0b7dd  c3cf0c0aa3ad

So I think we can close this issue.

Actions

Copy link