Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8 - mgr - Ceph

Actions

Copy link

Bug #45439

open

High CPU utilization for large clusters in ceph-mgr in 14.2.8

Added by Andras Pataki about 4 years ago. Updated about 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.8

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We upgraded our largest cluster from Mimic to Nautilus (14.2.8) recently - it has about 3500 OSDs. Now ceph-mgr is constantly at 100-200% CPU (1-2 cores), and becomes unresponsive after a few minutes. The finisher-Mgr queue length grows (I've seen it at over 100k) - similar symptoms as seen with earlier Nautilus releases by many. This is what
it looks like after an hour of running:

"finisher-Mgr": {
        "queue_len": 66078,
        "complete_latency": {
            "avgcount": 21,
            "sum": 2098.408767721,
            "avgtime": 99.924227034
        }
    },

We have a pretty vanilla manager config, here are the enabled modules:

"always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator_cli",
        "progress",
        "rbd_support",
        "status",
        "volumes" 
    ],
    "enabled_modules": [
        "restful" 
    ],

After some investigation, it looks like on our large cluster, ceph-mgr is not able to keep up with the status updates from about 3500 OSDs. By default OSDs send updates to ceph-mgr every 5 seconds, which, in our case, turns to about 700 messages/s to ceph-mgr. It looks from gdb traces that ceph-mgr runs some python code for each of them - so 700 python snipets/s might be too much. Increasing mgr_stats_period to 15 seconds reduces the load and brings ceph-mgr back to responsive again.

I also checked our other clusters and they have about proportionately lower load on ceph-mgr based on their OSD counts.

Attached are the gdbpmp trace of ceph-mgr and an osd map for the cluster. Thread 38 seems to be a pretty busy one.

Files

Download all files

gdbpmp-ceph-mgr.txt.gz (12.1 KB) gdbpmp-ceph-mgr.txt.gz	gdbpmp trace of ceph-mgr	Andras Pataki, 05/08/2020 01:42 AM
osdmap.gz (194 KB) osdmap.gz	osd map	Andras Pataki, 05/08/2020 01:45 AM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #45439

High CPU utilization for large clusters in ceph-mgr in 14.2.8

Updated by Josh Durgin about 4 years ago

Updated by Andras Pataki about 4 years ago

Updated by Josh Durgin about 4 years ago

Updated by Josh Durgin about 4 years ago

Updated by Lenz Grimmer about 4 years ago

Updated by Lenz Grimmer about 4 years ago