Bug #40260: Memory leak in ceph-mgr - mgr - Ceph

Actions

Copy link

Bug #40260

open

Memory leak in ceph-mgr

Added by Carlos Valiente almost 5 years ago. Updated almost 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v14.2.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I've just set up a Ceph cluster using Rook version 1.0.2, and I noticed that the memory usage of the ceph-mgr process was growing linearly with time, at a rate of about 70 MB per hour on an otherwise idle Ceph cluster.

My cluster is using the container image ceph/ceph:v14.2.1-20190430. A similar issue has been reported at https://github.com/ceph/ceph-container/issues/1320 for the Ceph container image tagged as v13.2.4-20190109.

I set up a Kubernetes v1.14.3 cluster (controller node and 4 worker nodes) running stock Ubuntu 18.04.2 on OpenStack VMs using kubeadm. I'm using the Weave CNI plugin, and my Kubernetes kube-proxy is using IPVS (I mention these two networking details because the symptoms seem to be network-related).

I set up then a Rook v1.0.2 Kubernetes operator following the steps described in the documentation:

$ kubectl apply -f cluster/examples/kubernetes/ceph/common.yaml
$ kubectl apply -f cluster/examples/kubernetes/ceph/operator.yaml

I took common.yaml and operator.yaml from Git tag v1.0.2 and applied them unmodified.

I then deployed a Rook Ceph cluster by doing the recommended step:

$ kubectl apply -f cluster/examples/kubernetes/ceph/cluster.yaml

I also took cluster.yaml from Git tag v1.0.2, but I did the following two changes:

I set network.hostNetwork to true, because otherwise the OSD pods would not start with the default setting in a previous attempt, as described by https://github.com/rook/rook/issues/3140 (although, in my case, the OSD pods did start after specifying hostNetwork: true).
I explicitly set the storage device I want to use.

$ git diff
diff --git a/cluster/examples/kubernetes/ceph/cluster.yaml b/cluster/examples/kubernetes/ceph/cluster.yaml
index cb130e88..e14a51a4 100644
--- a/cluster/examples/kubernetes/ceph/cluster.yaml
+++ b/cluster/examples/kubernetes/ceph/cluster.yaml
@@ -43,7 +43,7 @@ spec:
     # ssl: true
   network:
     # toggle to use hostNetwork
-    hostNetwork: false
+    hostNetwork: true
   rbdMirroring:
     # The number of daemons that will perform the rbd mirroring.
     # rbd mirroring must be configured with "rbd mirror" from the rook toolbox.
@@ -90,8 +90,8 @@ spec:
 #    osd:
   storage: # cluster level storage configuration and selection
     useAllNodes: true
-    useAllDevices: true
-    deviceFilter:
+    useAllDevices: false
+    deviceFilter: vdb
     location:
     config:
       # The default and recommended storeType is dynamically set to bluestore for devices and filestore for directories.
$

All Kubernetes pods started successfully:

$ kubectl -n rook-ceph get pods
NAME                                       READY   STATUS      RESTARTS   AGE
rook-ceph-agent-5kzbp                      1/1     Running     0          68m
rook-ceph-agent-7vwzt                      1/1     Running     0          68m
rook-ceph-agent-khzmv                      1/1     Running     0          68m
rook-ceph-agent-wcps4                      1/1     Running     0          68m
rook-ceph-mgr-a-774b6d6d78-qscl8           1/1     Running     0          58m
rook-ceph-mon-a-6bb6bdf97c-ppdsp           1/1     Running     0          59m
rook-ceph-mon-b-d9d898846-dk6f7            1/1     Running     0          59m
rook-ceph-mon-c-68d59df9f5-w2c86           1/1     Running     0          58m
rook-ceph-operator-54889dcbb-8jzn7         1/1     Running     0          69m
rook-ceph-osd-0-76bcb77688-gq6ln           1/1     Running     0          56m
rook-ceph-osd-1-6dfb7fd8cf-pb9jq           1/1     Running     0          56m
rook-ceph-osd-2-5dbcc8b98c-wdwpg           1/1     Running     0          56m
rook-ceph-osd-3-86dd498b-hkm9k             1/1     Running     0          56m
rook-ceph-osd-prepare-k8s-worker-0-zxzbd   0/2     Completed   0          57m
rook-ceph-osd-prepare-k8s-worker-1-6x74m   0/2     Completed   0          57m
rook-ceph-osd-prepare-k8s-worker-2-kn9hj   0/2     Completed   0          57m
rook-ceph-osd-prepare-k8s-worker-3-bmbtv   0/2     Completed   0          57m
rook-discover-chv5s                        1/1     Running     0          68m
rook-discover-mx9hv                        1/1     Running     0          68m
rook-discover-tglh6                        1/1     Running     0          68m
rook-discover-vfvbf                        1/1     Running     0          68m
$

At this point, even before making use of the Ceph cluster (by setting up a Ceph object store, which is what I'm after), I could already see the pod rook-ceph-mgr-a-774b6d6d78-qscl8 emitting a constant stream of log messages like this

debug 2019-06-11 14:30:45.824 7fb48cc6f700  0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1
debug 2019-06-11 14:30:44.824 7fb48cc6f700  0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1
debug 2019-06-11 14:30:43.824 7fb48cc6f700  0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1
debug 2019-06-11 14:30:42.824 7fb48cc6f700  0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1
debug 2019-06-11 14:30:41.824 7fb48cc6f700  0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1
debug 2019-06-11 14:30:40.824 7fb48cc6f700  0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1

The memory usage of the ceph-mgr pod has grown from about 180 MB at start time to about 450 MB in about 3.5 hours (about 70 MB per hour).

The Ceph process running in the Kubernetes pod is the following:

$ kubectl -n rook-ceph exec rook-ceph-mgr-a-774b6d6d78-qscl8  ps auxww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.3  1.3 1557508 454876 ?      Ssl  14:05   2:57 ceph-mgr --fsid=a28b2d95-538b-46de-8b9e-b24582edd3f8 --keyring=/etc/ceph/keyring-store/keyring --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug  --default-log-to-file=false --default-mon-cluster-log-to-file=false --mon-host=[v2:10.240.0.20:3300,v1:10.240.0.20:6789],[v2:10.240.0.21:3300,v1:10.240.0.21:6789],[v2:10.240.0.22:3300,v1:10.240.0.22:6789] --mon-initial-members=a,b,c --id=a --foreground
root       736  0.0  0.0  11832  2956 pts/0    Ss+  17:39   0:00 bash
root       754  0.0  0.0  51752  3460 ?        Rs   17:40   0:00 ps auxww
$

Environment¶

Kubernetes 1.14.3 set up with kubeadm on Ubuntu 18.04.2 VMs (running on OpenStack)
Weave CNI plugin (the Kubernetes networking helper).
Docker 18.06.2-ce
Linux kernel: Reported by uname -a as: Linux k8s-worker-0 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux`
Ceph version: Using image ceph/ceph:v14.2.1-20190430

The Ceph version reported by ceph -v inside the Kubernetes pod for ceph-mgr is the following:

$ kubectl -n rook-ceph exec rook-ceph-mgr-a-774b6d6d78-qscl8 -- ceph -v
ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
$