mgr/insights: mgr consumes excessive amounts of memory
Description of problem¶
After replacing a failed osd, and the ongoing replication the degraded pgs (roughly 4tb, ongoging for about 2 days), there have been a couple of oomkills and the mgr continues to consume lots of memory (15gb, increases rather quicky).
Furthermore the logfile is growing at similar pace since it keeps logging the all placement groups and an error that it can't dump the pg state into mgr/insights/health_history/2021-07-11_19
mgr set_store mon returned -27: error: entry size limited to 65536 bytes. Use 'mon config key max entry size' to manually adjust"
The thing it tries to dump is a 22k line json (when formatted), which contains a history of all placement groups since some point in time (i couldn't pin that down, but the same message is included multiple times, see attached json).
ceph versionstring: ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
- Platform (OS/distro/release): CentOS Linux release 8.4.2105
- Cluster details (nodes, monitors, OSDs): 6 nodes, 5 mons, 42 osds
Seems to be ongoing. Restarting the mgr does not fix it (see bumps in the graph, these were me restarting the mgr).
#2 Updated by Thore K over 2 years ago
I've been able to reproduce this through the following operations:
systemctl stop ceph-osd@10 ceph osd out 10 ceph osd destroy 10 --yes-i-really-mean-it ceph-volume lvm zap /dev/sda --destroy ceph-volume lvm create --osd-id 10 --bluestore --crush-device-class=hdd --dmcrypt --data /dev/sda --block.db /dev/sdh4
And again, while the backfilling takes place the mgr memory consumption grows rapidly.