Bug #45439
openHigh CPU utilization for large clusters in ceph-mgr in 14.2.8
0%
Description
We upgraded our largest cluster from Mimic to Nautilus (14.2.8) recently - it has about 3500 OSDs. Now ceph-mgr is constantly at 100-200% CPU (1-2 cores), and becomes unresponsive after a few minutes. The finisher-Mgr queue length grows (I've seen it at over 100k) - similar symptoms as seen with earlier Nautilus releases by many. This is what
it looks like after an hour of running:
"finisher-Mgr": {
"queue_len": 66078,
"complete_latency": {
"avgcount": 21,
"sum": 2098.408767721,
"avgtime": 99.924227034
}
},
We have a pretty vanilla manager config, here are the enabled modules:
"always_on_modules": [
"balancer",
"crash",
"devicehealth",
"orchestrator_cli",
"progress",
"rbd_support",
"status",
"volumes"
],
"enabled_modules": [
"restful"
],
After some investigation, it looks like on our large cluster, ceph-mgr is not able to keep up with the status updates from about 3500 OSDs. By default OSDs send updates to ceph-mgr every 5 seconds, which, in our case, turns to about 700 messages/s to ceph-mgr. It looks from gdb traces that ceph-mgr runs some python code for each of them - so 700 python snipets/s might be too much. Increasing mgr_stats_period to 15 seconds reduces the load and brings ceph-mgr back to responsive again.
I also checked our other clusters and they have about proportionately lower load on ceph-mgr based on their OSD counts.
Attached are the gdbpmp trace of ceph-mgr and an osd map for the cluster. Thread 38 seems to be a pretty busy one.
Files