Bug #40260
openMemory leak in ceph-mgr
0%
Description
I've just set up a Ceph cluster using Rook version 1.0.2, and I noticed that the memory usage of the ceph-mgr
process was growing linearly with time, at a rate of about 70 MB per hour on an otherwise idle Ceph cluster.
My cluster is using the container image ceph/ceph:v14.2.1-20190430
. A similar issue has been reported at https://github.com/ceph/ceph-container/issues/1320 for the Ceph container image tagged as v13.2.4-20190109
.
I set up a Kubernetes v1.14.3 cluster (controller node and 4 worker nodes) running stock Ubuntu 18.04.2 on OpenStack VMs using kubeadm
. I'm using the Weave CNI plugin, and my Kubernetes kube-proxy
is using IPVS (I mention these two networking details because the symptoms seem to be network-related).
I set up then a Rook v1.0.2 Kubernetes operator following the steps described in the documentation:
$ kubectl apply -f cluster/examples/kubernetes/ceph/common.yaml $ kubectl apply -f cluster/examples/kubernetes/ceph/operator.yaml
I took common.yaml
and operator.yaml
from Git tag v1.0.2 and applied them unmodified.
I then deployed a Rook Ceph cluster by doing the recommended step:
$ kubectl apply -f cluster/examples/kubernetes/ceph/cluster.yaml
I also took cluster.yaml
from Git tag v1.0.2, but I did the following two changes:
- I set
network.hostNetwork
totrue
, because otherwise the OSD pods would not start with the default setting in a previous attempt, as described by https://github.com/rook/rook/issues/3140 (although, in my case, the OSD pods did start after specifyinghostNetwork: true
). - I explicitly set the storage device I want to use.
$ git diff diff --git a/cluster/examples/kubernetes/ceph/cluster.yaml b/cluster/examples/kubernetes/ceph/cluster.yaml index cb130e88..e14a51a4 100644 --- a/cluster/examples/kubernetes/ceph/cluster.yaml +++ b/cluster/examples/kubernetes/ceph/cluster.yaml @@ -43,7 +43,7 @@ spec: # ssl: true network: # toggle to use hostNetwork - hostNetwork: false + hostNetwork: true rbdMirroring: # The number of daemons that will perform the rbd mirroring. # rbd mirroring must be configured with "rbd mirror" from the rook toolbox. @@ -90,8 +90,8 @@ spec: # osd: storage: # cluster level storage configuration and selection useAllNodes: true - useAllDevices: true - deviceFilter: + useAllDevices: false + deviceFilter: vdb location: config: # The default and recommended storeType is dynamically set to bluestore for devices and filestore for directories. $
All Kubernetes pods started successfully:
$ kubectl -n rook-ceph get pods NAME READY STATUS RESTARTS AGE rook-ceph-agent-5kzbp 1/1 Running 0 68m rook-ceph-agent-7vwzt 1/1 Running 0 68m rook-ceph-agent-khzmv 1/1 Running 0 68m rook-ceph-agent-wcps4 1/1 Running 0 68m rook-ceph-mgr-a-774b6d6d78-qscl8 1/1 Running 0 58m rook-ceph-mon-a-6bb6bdf97c-ppdsp 1/1 Running 0 59m rook-ceph-mon-b-d9d898846-dk6f7 1/1 Running 0 59m rook-ceph-mon-c-68d59df9f5-w2c86 1/1 Running 0 58m rook-ceph-operator-54889dcbb-8jzn7 1/1 Running 0 69m rook-ceph-osd-0-76bcb77688-gq6ln 1/1 Running 0 56m rook-ceph-osd-1-6dfb7fd8cf-pb9jq 1/1 Running 0 56m rook-ceph-osd-2-5dbcc8b98c-wdwpg 1/1 Running 0 56m rook-ceph-osd-3-86dd498b-hkm9k 1/1 Running 0 56m rook-ceph-osd-prepare-k8s-worker-0-zxzbd 0/2 Completed 0 57m rook-ceph-osd-prepare-k8s-worker-1-6x74m 0/2 Completed 0 57m rook-ceph-osd-prepare-k8s-worker-2-kn9hj 0/2 Completed 0 57m rook-ceph-osd-prepare-k8s-worker-3-bmbtv 0/2 Completed 0 57m rook-discover-chv5s 1/1 Running 0 68m rook-discover-mx9hv 1/1 Running 0 68m rook-discover-tglh6 1/1 Running 0 68m rook-discover-vfvbf 1/1 Running 0 68m $
At this point, even before making use of the Ceph cluster (by setting up a Ceph object store, which is what I'm after), I could already see the pod rook-ceph-mgr-a-774b6d6d78-qscl8
emitting a constant stream of log messages like this
debug 2019-06-11 14:30:45.824 7fb48cc6f700 0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1 debug 2019-06-11 14:30:44.824 7fb48cc6f700 0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1 debug 2019-06-11 14:30:43.824 7fb48cc6f700 0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1 debug 2019-06-11 14:30:42.824 7fb48cc6f700 0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1 debug 2019-06-11 14:30:41.824 7fb48cc6f700 0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1 debug 2019-06-11 14:30:40.824 7fb48cc6f700 0 client.0 ms_handle_reset on v2:10.240.0.20:6800/1
The memory usage of the ceph-mgr
pod has grown from about 180 MB at start time to about 450 MB in about 3.5 hours (about 70 MB per hour).
The Ceph process running in the Kubernetes pod is the following:
$ kubectl -n rook-ceph exec rook-ceph-mgr-a-774b6d6d78-qscl8 ps auxww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 1.3 1.3 1557508 454876 ? Ssl 14:05 2:57 ceph-mgr --fsid=a28b2d95-538b-46de-8b9e-b24582edd3f8 --keyring=/etc/ceph/keyring-store/keyring --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --mon-host=[v2:10.240.0.20:3300,v1:10.240.0.20:6789],[v2:10.240.0.21:3300,v1:10.240.0.21:6789],[v2:10.240.0.22:3300,v1:10.240.0.22:6789] --mon-initial-members=a,b,c --id=a --foreground root 736 0.0 0.0 11832 2956 pts/0 Ss+ 17:39 0:00 bash root 754 0.0 0.0 51752 3460 ? Rs 17:40 0:00 ps auxww $
Environment¶
- Kubernetes 1.14.3 set up with
kubeadm
on Ubuntu 18.04.2 VMs (running on OpenStack) - Weave CNI plugin (the Kubernetes networking helper).
- Docker 18.06.2-ce
- Linux kernel: Reported by
uname -a
as: Linux k8s-worker-0 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux` - Ceph version: Using image ceph/ceph:v14.2.1-20190430
The Ceph version reported by ceph -v
inside the Kubernetes pod for ceph-mgr
is the following:
$ kubectl -n rook-ceph exec rook-ceph-mgr-a-774b6d6d78-qscl8 -- ceph -v ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable) $
Updated by Frank Ritchie almost 4 years ago
you may want to try using
hostPID: true
in the mgr deployment, see: