Bug #57305
closedbootstrap mgr timeout is too short
0%
Description
While doing a test installation of Ceph Quincy on our rock64 ARM64 machines, I spotted a problem with the timeout for waiting for the manager to come alive. I'd like to suppose to make it wait a little longer for slower machines. The mgr come up eventually with only 2 ticks left. Since the rock64 is running with an attached SSD I would suggest to raise the ticks to 30.
Updated by Andreas Elvers over 1 year ago
This concerns the bootstrap mgr wait timeout.
Updated by Andreas Elvers over 1 year ago
root@rock64:~# cephadm bootstrap --mon-ip 192.168.50.36 Creating directory /etc/ceph for ceph.conf Verifying podman|docker is present... Verifying lvm2 is present... Verifying time synchronization is in place... Unit chrony.service is enabled and running Repeating the final host check... docker (/usr/bin/docker) is present systemctl is present lvcreate is present Unit chrony.service is enabled and running Host looks OK Cluster fsid: cb682766-252f-11ed-a9f6-2e18419c566b Verifying IP 192.168.50.36 port 3300 ... Verifying IP 192.168.50.36 port 6789 ... Mon IP `192.168.50.36` is in CIDR network `192.168.50.0/23` Mon IP `192.168.50.36` is in CIDR network `192.168.50.0/23` Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network Pulling container image quay.io/ceph/ceph:v17... Ceph version: ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable) Extracting ceph user uid/gid from container image... Creating initial keys... Creating initial monmap... Creating mon... Waiting for mon to start... Waiting for mon... mon is available Assimilating anything we can from ceph.conf... Generating new minimal ceph.conf... Restarting the monitor... Setting mon public_network to 192.168.50.0/23 Wrote config to /etc/ceph/ceph.conf Wrote keyring to /etc/ceph/ceph.client.admin.keyring Creating mgr... Verifying port 9283 ... Waiting for mgr to start... Waiting for mgr... mgr not available, waiting (1/15)... mgr not available, waiting (2/15)... mgr not available, waiting (3/15)... mgr not available, waiting (4/15)... mgr not available, waiting (5/15)... mgr not available, waiting (6/15)... mgr not available, waiting (7/15)... mgr not available, waiting (8/15)... mgr not available, waiting (9/15)... mgr not available, waiting (10/15)... mgr not available, waiting (11/15)... mgr not available, waiting (12/15)... mgr not available, waiting (13/15)... mgr is available [ ... ]
Updated by Redouane Kachach Elhichou over 1 year ago
You can adjust the timeout (to use a higher value) by providing the --timeout argument (in seconds by default 60s) and you can also increase the --retry counter (by default 15)
Updated by Redouane Kachach Elhichou over 1 year ago
- Category changed from orchestrator to cephadm (binary)
Updated by Andreas Elvers over 1 year ago
To be more specific on the machine setup: It is a 4GB rock64 running ubuntu 20.04 off an USB attached SSD.
Updated by Redouane Kachach Elhichou over 1 year ago
Thanks, did you try to use the --timeout / --retry arguments to adapt the timeout to your specific case and see if these args helps to solve your issue?
Updated by Andreas Elvers over 1 year ago
Redouane Kachach Elhichou wrote:
Thanks, did you try to use the --timeout / --retry arguments to adapt the timeout to your specific case and see if these args helps to solve your issue?
It worked for me anyway. The mgr was created just in time. I am currently in testing, so I will re-do the setup and set the timeout to 30. But I think raising the timeout default a bit could be helpful. Thanks for pointing out the --timeout option.
Updated by Redouane Kachach Elhichou over 1 year ago
Raising the default timeout may also lead to slow reaction in case of real issue. IMHO no changes are needed since the default timeout is already high enough for most use cases and is adjustable by using the arguments I mentioned earlier. I think since there's no issue with the current behavior I'd suggest to close this tracker.
Updated by Redouane Kachach Elhichou over 1 year ago
- Status changed from New to Closed
Closing because the timeout is configurable and the default values are good enough.