Bug #37980
openluminous: osd memery use very high,and missmatch between res and heap stats
0%
Description
ceph 12.2.1
3 nodes, 30 osds per node
ec pool:4+2
After running for 2 months,we find some osds memery use very high in top lists,4-5G,and the heap stats like this:
top - 10:45:01 up 73 days, 1:20, 1 user, load average: 10.26, 9.74, 9.95
Tasks: 657 total, 3 running, 654 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.6 us, 6.8 sy, 0.0 ni, 82.7 id, 3.7 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem: 65325552 total, 64448092 used, 877460 free, 90120 buffers
KiB Swap: 0 total, 0 used, 0 free. 446164 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18385 ceph 20 0 7369196 5.114g 7232 S 6.1 8.2 2873:28 /usr/bin/ceph-osd -f --cluster ceph --id 61 --setuser ceph --setg+
osd.61 tcmalloc heap stats:
MALLOC: 2239198296 ( 2135.5 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 72369432 ( 69.0 MiB) Bytes in central cache freelist
MALLOC: + 13839792 ( 13.2 MiB) Bytes in transfer cache freelist
MALLOC: + 104315104 ( 99.5 MiB) Bytes in thread cache freelists
MALLOC: + 25096352 ( 23.9 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 2454818976 ( 2341.1 MiB) Actual memory used (physical + swap)
MALLOC: + 4095991808 ( 3906.2 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 6550810784 ( 6247.3 MiB) Virtual address space used
MALLOC:
MALLOC: 136858 Spans in use
MALLOC: 63 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
RES shows osd.61 memery use 5G+,but heap stats "Actual memory used" just 2G+,and we find that the osds with high res also have high "Bytes released to OS".
After we restart the osd, the memery use is released.
Has anyone encountered a similar problem?
Updated by zhou yang over 5 years ago
I am using bluestore, and my client is rbd with ec datapool.
The cluster is running on Centos 7.0.1406, tcmalloc version is 4.2.6 .
Updated by Nathan Cutler over 5 years ago
ceph 12.2.1
Are you really running that version, 12.2.1 ?
Updated by Mark Nelson over 5 years ago
Hi,
Often times this kind of thing is related to transparent huge pages. There definitely seems to be different kinds of behavior on different kernels from what I've seen. There's a higher level tcmalloc issue for this here (not ceph related):
https://github.com/gperftools/gperftools/issues/990
I'd try either disabling THP or setting max_ptes_none to 0 as reported in that issue and see if that helps. I'm pretty sure I've done that with Ceph in the past and seen improvements in the behavior when this has been a problem.
Updated by zhou yang over 5 years ago
Thanks a lot.
disabling THP or setting max_ptes_none to 0
I will try this later and see if that helps. Since it can not reproduce in short time,I will keep track and report the result.