Project

General

Profile

Bug #12516

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES environment variable defualt value 32 MB is enough for Ceph daemons ?

Added by Vikhyat Umrao over 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

mpstat sees high %user in libtcmalloc.so.4.1.2

While a fio test is running on a RBD image mapped to a VM.

Running “perf top” running on an OSD shows :

34.37% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans
18.06% libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache
13.76% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::ReleaseToSpans
1.45% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::RemoveRange

The "mpstat -P ALL 1" on the ceph OSD node shows the value between 80 and 90 in the %user column.

https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23575.html

As described in these notes, the gperftools-2.1.90 has the fix. The gperftools’s version in ceph 0.80.9 is gperftools-libs-2.1-1.el7

There is also an environment variable TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and wonder the right value to set this environment variable with?

and We are looking for TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES right value.

Version-Release number of selected component (if applicable):

rpm -qa | grep ceph
ceph-common-0.80.9-0.el7.x86_64
ceph-0.80.9-0.el7.x86_64
libcephfs1-0.80.9-0.el7.x86_64
python-ceph-0.80.9-0.el7.x86_64
cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.1 (Maipo)
rpm -qa | grep gperftools
gperftools-libs-2.1-1.el7
$ cat test.fio
[global]
ioengine=libaio
iodepth=32
rw=randwrite
runtime=60
bs=16k
direct=1
buffered=0
size=1024M
numjobs=4
group_reporting

[test]
directory=/mnt/test

To run the fio in the VM,

$ fio test.fio

We have started working on this issue and found that without this patch
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES value will not show any affect.

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES doesn't affect tcmalloc behavior
https://code.google.com/p/gperftools/issues/detail?id=585

patch can be found in above given link.

So as per my understanding we need to backport this patch to ceph shipped gperftools and also need to modify our init script if 32 M is not enough.

file /etc/init.d/ceph and add one line :

cmd="TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=<the-right-value> $cmd"

After [ -n "$max_open_files" ] && files="ulimit -n $max_open_files;" and before if [ -n "$SYSTEMD_RUN" ];.

so final script would be :

[ -n "$max_open_files" ] && files="ulimit -n $max_open_files;" 
cmd="TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=<the-right-value> $cmd" 
if [ -n "$SYSTEMD_RUN" ]; then
cmd="$SYSTEMD_RUN -r bash -c '$files $cmd --cluster $cluster -f'" 
else
cmd="$files $wrap $cmd --cluster $cluster $runmode" 
fi

History

#1 Updated by Kefu Chai over 8 years ago

  • Description updated (diff)

#2 Updated by Kefu Chai over 8 years ago

here is a related discussion on ceph-devel: http://www.spinics.net/lists/ceph-devel/msg23757.html

#3 Updated by Star Guo over 8 years ago

I compile and install gperftool 2.4 and ceph 0.80.10 on CentOS 7.1, adding `TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES` into the script. However, when I test fio by randowrite with bs=4k, the cpu usage of osd nodes still very high.


[root@osd-1 ~]# rpm -qa | grep gper
gperftools-2.4-1.el7.centos.x86_64
gperftools-libs-2.4-1.el7.centos.x86_64
gperftools-devel-2.4-1.el7.centos.x86_64


[root@osd-1 ~]# rpm -qa | grep ceph
ceph-deploy-1.5.25-1.el7.noarch
ceph-test-0.80.10-0.2.el7.centos.x86_64
libcephfs1-0.80.10-0.2.el7.centos.x86_64
ceph-common-0.80.10-0.2.el7.centos.x86_64
python-cephfs-0.80.10-0.2.el7.centos.x86_64
ceph-0.80.10-0.2.el7.centos.x86_64

How can I fix it ?

#4 Updated by huang jun over 8 years ago

maybe you can try jemalloc,
we have found problem like yours when using tcmalloc,
and we switched to jemalloc, it works better until now.

#5 Updated by Vikhyat Umrao almost 8 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF