Bug #57305: bootstrap mgr timeout is too short - Orchestrator - Ceph

Actions

Copy link

Bug #57305

closed

bootstrap mgr timeout is too short

Added by Andreas Elvers over 1 year ago. Updated over 1 year ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

cephadm (binary)

Target version:

Ceph - v17.2.4

% Done:

Source:

Community (user)

Tags:

cephadm arm

Backport:

Regression:

Severity:

5 - suggestion

Reviewed:

Affected Versions:

Ceph - v17.2.3

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

While doing a test installation of Ceph Quincy on our rock64 ARM64 machines, I spotted a problem with the timeout for waiting for the manager to come alive. I'd like to suppose to make it wait a little longer for slower machines. The mgr come up eventually with only 2 ticks left. Since the rock64 is running with an attached SSD I would suggest to raise the ticks to 30.

Actions

Copy link

Updated by Andreas Elvers over 1 year ago

This concerns the bootstrap mgr wait timeout.

Actions

Copy link

Updated by Andreas Elvers over 1 year ago

root@rock64:~# cephadm bootstrap --mon-ip 192.168.50.36
Creating directory /etc/ceph for ceph.conf
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chrony.service is enabled and running
Repeating the final host check...
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit chrony.service is enabled and running
Host looks OK
Cluster fsid: cb682766-252f-11ed-a9f6-2e18419c566b
Verifying IP 192.168.50.36 port 3300 ...
Verifying IP 192.168.50.36 port 6789 ...
Mon IP `192.168.50.36` is in CIDR network `192.168.50.0/23`
Mon IP `192.168.50.36` is in CIDR network `192.168.50.0/23`
Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network
Pulling container image quay.io/ceph/ceph:v17...
Ceph version: ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)
Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting mon public_network to 192.168.50.0/23
Wrote config to /etc/ceph/ceph.conf
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Creating mgr...
Verifying port 9283 ...
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/15)...
mgr not available, waiting (2/15)...
mgr not available, waiting (3/15)...
mgr not available, waiting (4/15)...
mgr not available, waiting (5/15)...
mgr not available, waiting (6/15)...
mgr not available, waiting (7/15)...
mgr not available, waiting (8/15)...
mgr not available, waiting (9/15)...
mgr not available, waiting (10/15)...
mgr not available, waiting (11/15)...
mgr not available, waiting (12/15)...
mgr not available, waiting (13/15)...
mgr is available
[ ... ]

Actions

Copy link

Updated by Redouane Kachach Elhichou over 1 year ago

You can adjust the timeout (to use a higher value) by providing the --timeout argument (in seconds by default 60s) and you can also increase the --retry counter (by default 15)

Actions

Copy link

Updated by Redouane Kachach Elhichou over 1 year ago

Category changed from orchestrator to cephadm (binary)

Actions

Copy link

Updated by Andreas Elvers over 1 year ago

To be more specific on the machine setup: It is a 4GB rock64 running ubuntu 20.04 off an USB attached SSD.

Actions

Copy link

Updated by Redouane Kachach Elhichou over 1 year ago

Thanks, did you try to use the --timeout / --retry arguments to adapt the timeout to your specific case and see if these args helps to solve your issue?

Actions

Copy link

Updated by Andreas Elvers over 1 year ago

Redouane Kachach Elhichou wrote:

Thanks, did you try to use the --timeout / --retry arguments to adapt the timeout to your specific case and see if these args helps to solve your issue?

It worked for me anyway. The mgr was created just in time. I am currently in testing, so I will re-do the setup and set the timeout to 30. But I think raising the timeout default a bit could be helpful. Thanks for pointing out the --timeout option.

Actions

Copy link

Updated by Redouane Kachach Elhichou over 1 year ago

Raising the default timeout may also lead to slow reaction in case of real issue. IMHO no changes are needed since the default timeout is already high enough for most use cases and is adjustable by using the arguments I mentioned earlier. I think since there's no issue with the current behavior I'd suggest to close this tracker.

Actions

Copy link

Updated by Redouane Kachach Elhichou over 1 year ago

Status changed from New to Closed

Closing because the timeout is configurable and the default values are good enough.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #57305

bootstrap mgr timeout is too short

Updated by Andreas Elvers over 1 year ago

Updated by Andreas Elvers over 1 year ago

Updated by Redouane Kachach Elhichou over 1 year ago

Updated by Redouane Kachach Elhichou over 1 year ago

Updated by Andreas Elvers over 1 year ago

Updated by Redouane Kachach Elhichou over 1 year ago

Updated by Andreas Elvers over 1 year ago

Updated by Redouane Kachach Elhichou over 1 year ago

Updated by Redouane Kachach Elhichou over 1 year ago