Bug #58956
openError while deploying Ceph Quincy via ceph-ansible-stable-7 on Rocky 9
0%
Description
Hi,
I am trying to deploy Ceph Quincy using Ceph-ansible-stable-7 on Rocky 9.
I am using a containerized deployment on bare metal. My setup is comprised of 03 controller nodes and 27 storage nodes with 04 nvme disks each as OSDs.
I am using a 10Gbs ntework.
PS: the same hardware was used to deploy Ceph Pacific via ceph-ansible without any problems.
The deployment starts without issues until the task which starts the OSDs containers, here something happens and the monitors loose quorum then enter a a vicious circle of election and re-election of a monitor leader. The deployment then becomes slow then it exits with errors.
The monitor containers on two of my controllers show 100% of CPU usage.
I thought that it was a resource problem, so I modified the number of CPUs allocated to the monitor containers from 1 to 4, this didn't change anything, I tried with different docker versions, then I tried to deploy the Pacific version, the same behavior occurs.
Regards.
Files
Updated by Patrick Preisler about 1 year ago
I deploy Ceph Qunicy v17.2.5 with Cephadm on Rocky 9 and I can see the same behaviour.
Initial Deployment works, but as soon as you start adding OSDs the Leader MON Container goes to 100% CPU and stops responding. As soon as the election Process starts, the new Leader MON goes to 100% CPU as well and stops responding too. After some Time the first MON gets responsive again and a new election starts, but the new elected Leader goes to 100% CPU immediately again, making the Cluster unusable.
Ressources of the MON Containers is unrestricted, but it seems that only one Core gets utilized.
Updated by Wodel Youchi about 1 year ago
Hi,
Using podman instead of docker, the problem did not manifest itself, I was able to deploy the cluster.
Regards.
Updated by Patrick Preisler about 1 month ago
Hi,
we deployed a second cluster v17.2.7 a few weeks ago with docker + Rocky 9 and the Problem did not occur again. I guess this is fixed then.
We also upgraded the first cluster to Rocky 9 and there are no problems either.