Bug #58956: Error while deploying Ceph Quincy via ceph-ansible-stable-7 on Rocky 9 - Ceph - Ceph

Actions

Copy link

Bug #58956

open

Error while deploying Ceph Quincy via ceph-ansible-stable-7 on Rocky 9

Added by Wodel Youchi about 1 year ago. Updated about 1 month ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

ceph-ansible

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

I am trying to deploy Ceph Quincy using Ceph-ansible-stable-7 on Rocky 9.
I am using a containerized deployment on bare metal. My setup is comprised of 03 controller nodes and 27 storage nodes with 04 nvme disks each as OSDs.
I am using a 10Gbs ntework.

PS: the same hardware was used to deploy Ceph Pacific via ceph-ansible without any problems.

The deployment starts without issues until the task which starts the OSDs containers, here something happens and the monitors loose quorum then enter a a vicious circle of election and re-election of a monitor leader. The deployment then becomes slow then it exits with errors.

The monitor containers on two of my controllers show 100% of CPU usage.
I thought that it was a resource problem, so I modified the number of CPUs allocated to the monitor containers from 1 to 4, this didn't change anything, I tried with different docker versions, then I tried to deploy the Pacific version, the same behavior occurs.

Regards.

Files

ceph-mon.txt.tgz (550 KB) ceph-mon.txt.tgz

Wodel Youchi, 03/11/2023 11:55 PM

Actions

Copy link

Updated by Patrick Preisler about 1 year ago

I deploy Ceph Qunicy v17.2.5 with Cephadm on Rocky 9 and I can see the same behaviour.

Initial Deployment works, but as soon as you start adding OSDs the Leader MON Container goes to 100% CPU and stops responding. As soon as the election Process starts, the new Leader MON goes to 100% CPU as well and stops responding too. After some Time the first MON gets responsive again and a new election starts, but the new elected Leader goes to 100% CPU immediately again, making the Cluster unusable.
Ressources of the MON Containers is unrestricted, but it seems that only one Core gets utilized.

Actions

Copy link

Updated by Wodel Youchi about 1 year ago

Hi,

Using podman instead of docker, the problem did not manifest itself, I was able to deploy the cluster.

Regards.

Actions

Copy link

Updated by Patrick Preisler about 1 month ago

Hi,

we deployed a second cluster v17.2.7 a few weeks ago with docker + Rocky 9 and the Problem did not occur again. I guess this is fixed then.
We also upgraded the first cluster to Rocky 9 and there are no problems either.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #58956

Error while deploying Ceph Quincy via ceph-ansible-stable-7 on Rocky 9

Updated by Patrick Preisler about 1 year ago

Updated by Wodel Youchi about 1 year ago

Updated by Patrick Preisler about 1 month ago