Project

General

Profile

Bug #57305

bootstrap mgr timeout is too short

Added by Andreas Elvers 3 months ago. Updated 2 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
cephadm (binary)
Target version:
% Done:

0%

Source:
Community (user)
Tags:
cephadm arm
Backport:
Regression:
No
Severity:
5 - suggestion
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While doing a test installation of Ceph Quincy on our rock64 ARM64 machines, I spotted a problem with the timeout for waiting for the manager to come alive. I'd like to suppose to make it wait a little longer for slower machines. The mgr come up eventually with only 2 ticks left. Since the rock64 is running with an attached SSD I would suggest to raise the ticks to 30.

History

#1 Updated by Andreas Elvers 3 months ago

This concerns the bootstrap mgr wait timeout.

#2 Updated by Andreas Elvers 3 months ago

root@rock64:~# cephadm bootstrap --mon-ip 192.168.50.36
Creating directory /etc/ceph for ceph.conf
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chrony.service is enabled and running
Repeating the final host check...
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit chrony.service is enabled and running
Host looks OK
Cluster fsid: cb682766-252f-11ed-a9f6-2e18419c566b
Verifying IP 192.168.50.36 port 3300 ...
Verifying IP 192.168.50.36 port 6789 ...
Mon IP `192.168.50.36` is in CIDR network `192.168.50.0/23`
Mon IP `192.168.50.36` is in CIDR network `192.168.50.0/23`
Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network
Pulling container image quay.io/ceph/ceph:v17...
Ceph version: ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)
Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting mon public_network to 192.168.50.0/23
Wrote config to /etc/ceph/ceph.conf
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Creating mgr...
Verifying port 9283 ...
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/15)...
mgr not available, waiting (2/15)...
mgr not available, waiting (3/15)...
mgr not available, waiting (4/15)...
mgr not available, waiting (5/15)...
mgr not available, waiting (6/15)...
mgr not available, waiting (7/15)...
mgr not available, waiting (8/15)...
mgr not available, waiting (9/15)...
mgr not available, waiting (10/15)...
mgr not available, waiting (11/15)...
mgr not available, waiting (12/15)...
mgr not available, waiting (13/15)...
mgr is available
[ ... ]

#3 Updated by Redouane Kachach Elhichou 3 months ago

You can adjust the timeout (to use a higher value) by providing the --timeout argument (in seconds by default 60s) and you can also increase the --retry counter (by default 15)

#4 Updated by Redouane Kachach Elhichou 3 months ago

  • Category changed from orchestrator to cephadm (binary)

#5 Updated by Andreas Elvers 3 months ago

To be more specific on the machine setup: It is a 4GB rock64 running ubuntu 20.04 off an USB attached SSD.

#6 Updated by Redouane Kachach Elhichou 3 months ago

Thanks, did you try to use the --timeout / --retry arguments to adapt the timeout to your specific case and see if these args helps to solve your issue?

#7 Updated by Andreas Elvers 3 months ago

Redouane Kachach Elhichou wrote:

Thanks, did you try to use the --timeout / --retry arguments to adapt the timeout to your specific case and see if these args helps to solve your issue?

It worked for me anyway. The mgr was created just in time. I am currently in testing, so I will re-do the setup and set the timeout to 30. But I think raising the timeout default a bit could be helpful. Thanks for pointing out the --timeout option.

#8 Updated by Redouane Kachach Elhichou 3 months ago

Raising the default timeout may also lead to slow reaction in case of real issue. IMHO no changes are needed since the default timeout is already high enough for most use cases and is adjustable by using the arguments I mentioned earlier. I think since there's no issue with the current behavior I'd suggest to close this tracker.

#9 Updated by Redouane Kachach Elhichou 2 months ago

  • Status changed from New to Closed

Closing because the timeout is configurable and the default values are good enough.

Also available in: Atom PDF