Project

General

Profile

Actions

Bug #58956

open

Error while deploying Ceph Quincy via ceph-ansible-stable-7 on Rocky 9

Added by Wodel Youchi about 1 year ago. Updated 29 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

I am trying to deploy Ceph Quincy using Ceph-ansible-stable-7 on Rocky 9.
I am using a containerized deployment on bare metal. My setup is comprised of 03 controller nodes and 27 storage nodes with 04 nvme disks each as OSDs.
I am using a 10Gbs ntework.

PS: the same hardware was used to deploy Ceph Pacific via ceph-ansible without any problems.

The deployment starts without issues until the task which starts the OSDs containers, here something happens and the monitors loose quorum then enter a a vicious circle of election and re-election of a monitor leader. The deployment then becomes slow then it exits with errors.

The monitor containers on two of my controllers show 100% of CPU usage.
I thought that it was a resource problem, so I modified the number of CPUs allocated to the monitor containers from 1 to 4, this didn't change anything, I tried with different docker versions, then I tried to deploy the Pacific version, the same behavior occurs.

Regards.


Files

ceph-mon.txt.tgz (550 KB) ceph-mon.txt.tgz Wodel Youchi, 03/11/2023 11:55 PM
Actions #1

Updated by Patrick Preisler about 1 year ago

I deploy Ceph Qunicy v17.2.5 with Cephadm on Rocky 9 and I can see the same behaviour.

Initial Deployment works, but as soon as you start adding OSDs the Leader MON Container goes to 100% CPU and stops responding. As soon as the election Process starts, the new Leader MON goes to 100% CPU as well and stops responding too. After some Time the first MON gets responsive again and a new election starts, but the new elected Leader goes to 100% CPU immediately again, making the Cluster unusable.
Ressources of the MON Containers is unrestricted, but it seems that only one Core gets utilized.

Actions #2

Updated by Wodel Youchi about 1 year ago

Hi,

Using podman instead of docker, the problem did not manifest itself, I was able to deploy the cluster.

Regards.

Actions #3

Updated by Patrick Preisler 29 days ago

Hi,

we deployed a second cluster v17.2.7 a few weeks ago with docker + Rocky 9 and the Problem did not occur again. I guess this is fixed then.
We also upgraded the first cluster to Rocky 9 and there are no problems either.

Actions

Also available in: Atom PDF