Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8 - mgr - Ceph

Actions

Copy link

Bug #45439

open

High CPU utilization for large clusters in ceph-mgr in 14.2.8

Added by Andras Pataki almost 4 years ago. Updated almost 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.8

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We upgraded our largest cluster from Mimic to Nautilus (14.2.8) recently - it has about 3500 OSDs. Now ceph-mgr is constantly at 100-200% CPU (1-2 cores), and becomes unresponsive after a few minutes. The finisher-Mgr queue length grows (I've seen it at over 100k) - similar symptoms as seen with earlier Nautilus releases by many. This is what
it looks like after an hour of running:

"finisher-Mgr": {
        "queue_len": 66078,
        "complete_latency": {
            "avgcount": 21,
            "sum": 2098.408767721,
            "avgtime": 99.924227034
        }
    },

We have a pretty vanilla manager config, here are the enabled modules:

"always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator_cli",
        "progress",
        "rbd_support",
        "status",
        "volumes" 
    ],
    "enabled_modules": [
        "restful" 
    ],

After some investigation, it looks like on our large cluster, ceph-mgr is not able to keep up with the status updates from about 3500 OSDs. By default OSDs send updates to ceph-mgr every 5 seconds, which, in our case, turns to about 700 messages/s to ceph-mgr. It looks from gdb traces that ceph-mgr runs some python code for each of them - so 700 python snipets/s might be too much. Increasing mgr_stats_period to 15 seconds reduces the load and brings ceph-mgr back to responsive again.

I also checked our other clusters and they have about proportionately lower load on ceph-mgr based on their OSD counts.

Attached are the gdbpmp trace of ceph-mgr and an osd map for the cluster. Thread 38 seems to be a pretty busy one.

Files

Download all files

gdbpmp-ceph-mgr.txt.gz (12.1 KB) gdbpmp-ceph-mgr.txt.gz	gdbpmp trace of ceph-mgr	Andras Pataki, 05/08/2020 01:42 AM
osdmap.gz (194 KB) osdmap.gz	osd map	Andras Pataki, 05/08/2020 01:45 AM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Josh Durgin almost 4 years ago

From the gdbpmp output it looks like dump_pg_stats is causing a bottleneck on the memory allocator, with other threads waiting for tcmalloc locks. This is likely coming from the balancer's inefficient use of the pg map, which will be improved by https://github.com/ceph/ceph/pull/34356 which should be in the next nautilus point release.

The other issue you've identified on the mailing list already - your osdmap has no upmap entries, and it looks like the multi-step choose rule is not being accounted for properly by the balancer, so all the mappings it comes up with are cancelled as invalid due to not matching the crush rule. Let's use this issue to track this problem.
FTR the crush rule here is:

        step take root-disk
        step choose indep 3 type pod
        step choose indep 3 type rack
        step chooseleaf indep 1 type osd
        step emit

In the meantime you can disable the balancer module for this cluster, since it's just eating CPU right now. If you see more high CPU usage with the balancer, the progress module could also be a culprit - it's also consuming each pg map inefficiently (less so once https://github.com/ceph/ceph/pull/34356 is merged).

Actions

Copy link

Updated by Andras Pataki almost 4 years ago

The balancer did create about 2900 upmap entries successfully before it got stuck in a loop proposing invalid ones. To get back to a sane state, I removed all upmaps that it created and disabled it (ceph balancer off). The map and gdbpmp trace were taken after the upmap entries were removed and the balancer turned off. It looks like 'ceph mgr module disable balancer' doesn't work in Nautilus (balancer is an always on module) - same for the 'progress' module. Is there anything more to be done to disable it? The high CPU utilization still stays unfortunately.

Actions

Copy link