Project

General

Profile

Actions

Bug #35998

closed

ceph-mgr active daemon memory leak since mimic

Added by Tomasz Sętkowski over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am pretty sure I have been seeing memory leak like pattern on active ceph-mgr daemon since I have upgraded cluster to mimic v13.2.0 from luminous.

I have attached a screenshot from memory graph on the host with active mgr. After a few hours it can use few GB of RAM. It grows linearly until restarted.

I am building ceph from source, the official release v13.2.0 with additional parameter in CMAKE to use jemalloc (-DALLOCATOR=jemalloc).
Deployment with openstack-kolla project. Each daemon is running in docker with -net host.
Cluster is operational, although I am seeing exactly 240 messages per minute like this from host with active mgr:

-- 192.168.200.49:6802/7 >> 192.168.200.41:0/7 conn(0x7f77f7b21d00 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

There was definitely no such leak or log messages in luminous.

I have found a few sources saying this message is normal. Although I wonder if it is still normal at rate of 240/min? Considering that cluster is small - it has only 8 OSDs, 3 MONs, 3 MGRs.

I could debug this further, but please provide some tips on how to approach it.

Can I safely run ceph-mgr with valgrind in production, would it help at all?
Or should I definitely reproduce this somehow on a development built?
I can deploy a small cluster on VMs too, which is using all the same tools as production.


Files

ceph_mgr_leak.png (30.4 KB) ceph_mgr_leak.png Tomasz Sętkowski, 09/15/2018 02:51 PM

Related issues 2 (1 open1 closed)

Related to mgr - Bug #36471: connection resetting tcp errors between mgr daemonsNew10/16/2018

Actions
Copied to mgr - Backport #36342: luminous: ceph-mgr active daemon memory leak since mimicRejectedNathan CutlerActions
Actions

Also available in: Atom PDF