Project

General

Profile

Actions

Bug #45439

open

High CPU utilization for large clusters in ceph-mgr in 14.2.8

Added by Andras Pataki about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We upgraded our largest cluster from Mimic to Nautilus (14.2.8) recently - it has about 3500 OSDs. Now ceph-mgr is constantly at 100-200% CPU (1-2 cores), and becomes unresponsive after a few minutes. The finisher-Mgr queue length grows (I've seen it at over 100k) - similar symptoms as seen with earlier Nautilus releases by many. This is what
it looks like after an hour of running:

"finisher-Mgr": {
"queue_len": 66078,
"complete_latency": {
"avgcount": 21,
"sum": 2098.408767721,
"avgtime": 99.924227034
}
},

We have a pretty vanilla manager config, here are the enabled modules:

"always_on_modules": [
"balancer",
"crash",
"devicehealth",
"orchestrator_cli",
"progress",
"rbd_support",
"status",
"volumes"
],
"enabled_modules": [
"restful"
],

After some investigation, it looks like on our large cluster, ceph-mgr is not able to keep up with the status updates from about 3500 OSDs. By default OSDs send updates to ceph-mgr every 5 seconds, which, in our case, turns to about 700 messages/s to ceph-mgr. It looks from gdb traces that ceph-mgr runs some python code for each of them - so 700 python snipets/s might be too much. Increasing mgr_stats_period to 15 seconds reduces the load and brings ceph-mgr back to responsive again.

I also checked our other clusters and they have about proportionately lower load on ceph-mgr based on their OSD counts.

Attached are the gdbpmp trace of ceph-mgr and an osd map for the cluster. Thread 38 seems to be a pretty busy one.


Files

gdbpmp-ceph-mgr.txt.gz (12.1 KB) gdbpmp-ceph-mgr.txt.gz gdbpmp trace of ceph-mgr Andras Pataki, 05/08/2020 01:42 AM
osdmap.gz (194 KB) osdmap.gz osd map Andras Pataki, 05/08/2020 01:45 AM

Related issues 2 (0 open2 closed)

Related to mgr - Bug #43317: high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5Duplicate

Actions
Related to mgr - Bug #39264: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early ErrorsResolved

Actions
Actions

Also available in: Atom PDF