Project

General

Profile

Actions

Bug #12681

closed

tcmalloc doesn't release memory back to system

Added by Xiaogang Chang over 8 years ago. Updated over 8 years ago.

Status:
Won't Fix
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We observed high memory consumption by OSD daemons when we evaluating Hammer (0.94.2) in Yahoo! environment. Initially memory consumption per OSD was only about 1G (or even less) while it grow to almost 3G after 2 hours load test, which caused the system frozen/OOM (10 OSD run on one host with 32G physical memory) and finally some OSD processes were killed by system.

Below is the heap dump, almost 1.9G memory was occupied by heap freelist.

$ sudo ceph tell osd.32 heap stats
osd.32 tcmalloc heap stats:------------------------------------------------
MALLOC: 1217475232 ( 1161.1 MiB) Bytes in use by application
MALLOC: + 2031304704 ( 1937.2 MiB) Bytes in page heap freelist
MALLOC: + 49365752 ( 47.1 MiB) Bytes in central cache freelist
MALLOC: + 4159488 ( 4.0 MiB) Bytes in transfer cache freelist
MALLOC: + 56619624 ( 54.0 MiB) Bytes in thread cache freelists
MALLOC: + 14528664 ( 13.9 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 3373453464 ( 3217.2 MiB) Actual memory used (physical + swap)
MALLOC: + 4677632 ( 4.5 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 3378131096 ( 3221.6 MiB) Virtual address space used
MALLOC:
MALLOC: 135722 Spans in use
MALLOC: 1152 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------

After run "heap release" the 1.9G memory was able to be released to system.

$ sudo ceph tell osd.32 heap release
osd.32 releasing free RAM back to system.

$ sudo ceph tell osd.32 heap stats
osd.32 tcmalloc heap stats:------------------------------------------------
MALLOC: 1211800264 ( 1155.7 MiB) Bytes in use by application
MALLOC: + 286720 ( 0.3 MiB) Bytes in page heap freelist
MALLOC: + 54191184 ( 51.7 MiB) Bytes in central cache freelist
MALLOC: + 7419648 ( 7.1 MiB) Bytes in transfer cache freelist
MALLOC: + 54217192 ( 51.7 MiB) Bytes in thread cache freelists
MALLOC: + 14528664 ( 13.9 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 1342443672 ( 1280.3 MiB) Actual memory used (physical + swap)
MALLOC: + 2035687424 ( 1941.4 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 3378131096 ( 3221.6 MiB) Virtual address space used
MALLOC:
MALLOC: 135729 Spans in use
MALLOC: 1148 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------

Actions #1

Updated by Sage Weil over 8 years ago

Was any recovery taking place? This could be this bug: https://github.com/ceph/ceph/pull/5451

Actions #2

Updated by Sage Weil over 8 years ago

  • Status changed from New to Need More Info
Actions #3

Updated by Xiaogang Chang over 8 years ago

No recovery ongoing, I just started the service and put load for 1~2 hours, the run into this situation. There is no any OSD map changed during the test.

Actions #4

Updated by Sage Weil over 8 years ago

  • Subject changed from High memory consumption by Hammer OSD daemons to tcmalloc doesn't release memory back to system

We're hping to switch to jemalloc... tcmalloc is annoying in many other ways as well.

Actions #5

Updated by Greg Farnum over 8 years ago

We usually see this happening with specific releases of tcmalloc and host operating systems, although it's not something that's been well-quantified yet. What distro and tcmalloc version is in use?

Actions #6

Updated by Xiaogang Chang over 8 years ago

Hi, Greg

I'm using RHEL-6 with kernel 2.6.32, the tcmalloc is 4.1.0 with gperftools-2.0. I use the same OS and tcmalloc with Giant while there is no issue on Giant.

Actions #7

Updated by Greg Farnum over 8 years ago

Do you have any special config settings, or is everything on the default?

Actions #8

Updated by Greg Farnum over 8 years ago

And what's the source for that libtcmalloc release?

Actions #9

Updated by Sage Weil over 8 years ago

  • Status changed from Need More Info to Won't Fix

This is definitely a tcmalloc issue.. I think the place to go is rhel6 support? Perhaps there is a backport for a newer perftools available?

As a workaround, you might do the heap release command via cron or something...

Actions #10

Updated by c sights over 8 years ago

Possibly ceph OSDs could watch their free list (or used swap on the node) and tell themselves to free?

Actions #11

Updated by Star Guo over 8 years ago

https://github.com/kuszmaul/SuperMalloc seem precent well performance.

Actions

Also available in: Atom PDF