Project

General

Profile

Actions

Bug #59580

open

memory leak (RESTful module, maybe others?)

Added by Greg Farnum about 1 year ago. Updated 17 days ago.

Status:
Pending Backport
Priority:
Urgent
Category:
restful module
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
pacific quincy reef
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There are two separate reports on the mailing list of memory leaks in the mgr module:

[ceph-users] Memory leak in MGR after upgrading to pacific

After upgrading from Octopus (15.2.17) to Pacific (16.2.12) two days 
ago, I'm noticing that the MGR daemons keep failing over to standby and 
then back every 24hrs.   Watching the output of 'ceph orch ps' I can see 
that the memory consumption of the mgr is steadily growing until it 
becomes unresponsive.

When the mgr becomes unresponsive, tasks such as RESTful calls start to 
fail, and the standby eventually takes over after ~20 minutes. I've 
included a log of memory consumption (in 10 minute intervals) at the end 
of this message. While the cluster recovers during this issue, the loss 
of usage data during the outage, and the fact its occurring is 
problematic.  Any assistance would be appreciated.

Note, this is a cluster that has been upgraded from an original jewel 
based ceph using filestore, through bluestore conversion, container 
conversion, and now to Pacific.    The data below shows memory use with 
three mgr modules enabled:  cephadm, restful, iostat.   By disabling 
iostat, I can reduce the rate of memory consumption increasing to about 
200MB/hr.

[ceph-users] MGR Memory Leak in Restful

We've hit a memory leak in the Manager Restful interface, in versions 
17.2.5 & 17.2.6. On our main production cluster the active MGR grew to 
about 60G until the oom_reaper killed it, causing a successful failover 
and restart of the failed one. We can then see that the problem is 
recurring, actually on all 3 of our clusters.

We've traced this to when we enabled full Ceph monitoring by Zabbix last 
week. The leak is about 20GB per day, and seems to be proportional to 
the number of PGs. For some time we just had the default settings, and 
no memory leak, but had not got around to finding why many of the Zabbix 
items were showing as Access Denied. We traced this to the MGR's MON 
CAPS which were "mon 'profile mgr'".

The MON logs showed recurring:

log_channel(audit) log [DBG] : from='mgr.284576436 192.168.xxx.xxx:0/2356365' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]:  access denied

Changing the MGR CAPS to "mon 'allow *'" and restarting the MGR 
immediately allowed that to work, and all the follow-on REST calls worked.

log_channel(audit) log [DBG] : from='mgr.283590200 192.168.xxx.xxx:0/1779' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]: dispatch

However it has also caused the memory leak to start.

We've reverted the CAPS and are back to how we were.


Files

0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch (1.69 KB) 0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch Nitzan Mordechai, 09/21/2023 11:37 AM
0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch (2.06 KB) 0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch Nitzan Mordechai, 09/26/2023 11:25 AM
massif.out.3376365.gz (96.8 KB) massif.out.3376365.gz mgr handling rest calls Chris Palmer, 10/17/2023 04:11 PM
20231227-150450.jpg (61.7 KB) 20231227-150450.jpg node exporter shows memory xiaobao wen, 12/27/2023 07:05 AM
mgr_rgw_log.tar.gz (962 KB) mgr_rgw_log.tar.gz log for mgr and rgw xiaobao wen, 12/27/2023 07:42 AM
ceph-mgr-oomcrash-16-2-15.txt (34.5 KB) ceph-mgr-oomcrash-16-2-15.txt A. Saber Shenouda, 04/14/2024 02:42 PM

Related issues 3 (2 open1 closed)

Copied to mgr - Backport #63977: reef: memory leak (RESTful module, maybe others?)In ProgressKonstantin ShalyginActions
Copied to mgr - Backport #63978: pacific: memory leak (RESTful module, maybe others?)ResolvedKonstantin ShalyginActions
Copied to mgr - Backport #63979: quincy: memory leak (RESTful module, maybe others?)NewNitzan MordechaiActions
Actions

Also available in: Atom PDF