Project

General

Profile

Actions

Bug #17228

closed

Hammer OSD memory use very high (EL6)

Added by David Burns over 7 years ago. Updated almost 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a small 160TB Ceph cluster used only as a test s3 storage repository for media content.

Problem
Since upgrading from Firefly to Hammer we are experiencing very high OSD memory use of 2-3 GB per TB of OSD storage - typical OSD memory 6-10GB.
We have had to increase swap space to bring the cluster to a basic functional state. Clearly this will significantly impact system performance.

Hardware
4 x storage nodes with 16 OSDs/node. OSD nodes are reasonable spec SMC storage servers with dual Xeon CPUs. Storage is 16 x 3TB SAS disks in each node.
Installed RAM is 72GB (2 nodes) & 80GB (2 nodes). (We note that the installed RAM is at least 50% higher than the Ceph recommended 1 GB RAM per TB of storage.)

Software
OSD node OS is CentOS 6.8 (with updates).

"ceph -v" -> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
(all Ceph packages downloaded from download.ceph.com)

History
Emperor 0.72.2 -> Firefly 0.80.10 -> Hammer 0.94.6 -> Hammer 0.94.7 -> Hammer 0.94.9

Health
"ceph -s"

    cluster 6280e871-1a73-4d05-ba1d-xxxxxxxxxxxx
     health HEALTH_OK
     monmap e15: 3 mons at {dr-eq-1=10.60.82.162:6789/0,dr-gs-1=10.80.82.161:6789/0,sv470-rgr1=10.60.72.88:6789/0}
            election epoch 2620, quorum 0,1,2 sv470-rgr1,dr-eq-1,dr-gs-1
     mdsmap e2: 0/0/1 up
     osdmap e64644: 64 osds: 62 up, 62 in
      pgmap v13740565: 4432 pgs, 14 pools, 53672 GB data, 45264 kobjects
            105 TB used, 64073 GB / 168 TB avail
                4432 active+clean


NB 2 OSDs have been removed to ease memory pressure on the nodes with only 72GB of RAM

Example OS memory usage "free"

              total       used       free     shared    buffers     cached
 Mem:      74236480   73905092     331388          4       1712      21788
 -/+ buffers/cache:   73881592     354888
 Swap:     49938424   32412712   17525712

Example debug output for osd.30
2016-09-07 17:25:00.999381 7f5e84a387a0 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-osd, pid 20152
2016-09-07 17:25:01.199247 7f5e84a387a0 0 filestore(/var/lib/ceph/osd/ceph-30) backend xfs (magic 0x58465342)
2016-09-07 17:25:01.206209 7f5e84a387a0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-30) detect_features: FIEMAP ioctl is supported and appears to work
2016-09-07 17:25:01.206221 7f5e84a387a0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-30) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2016-09-07 17:25:01.318429 7f5e84a387a0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-30) detect_features: syscall(SYS_syncfs, fd) fully supported
2016-09-07 17:25:01.333549 7f5e84a387a0 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-30) detect_feature: extsize is disabled by conf
2016-09-07 17:25:01.624934 7f5e84a387a0 0 filestore(/var/lib/ceph/osd/ceph-30) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2016-09-07 17:25:01.675303 7f5e84a387a0 1 journal _open /var/lib/ceph/osd/ceph-30/journal fd 21: 17179869184 bytes, block size 4096 bytes, directio = 1, aio = 1
2016-09-07 17:25:01.752649 7f5e84a387a0 1 journal _open /var/lib/ceph/osd/ceph-30/journal fd 21: 17179869184 bytes, block size 4096 bytes, directio = 1, aio = 1
2016-09-07 17:25:01.820578 7f5e84a387a0 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2016-09-07 17:25:01.869221 7f5e84a387a0 0 osd.30 65620 crush map has features 2200130813952, adjusting msgr requires for clients
2016-09-07 17:25:01.869232 7f5e84a387a0 0 osd.30 65620 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons
2016-09-07 17:25:01.869239 7f5e84a387a0 0 osd.30 65620 crush map has features 2200130813952, adjusting msgr requires for osds
2016-09-07 17:25:01.869257 7f5e84a387a0 0 osd.30 65620 load_pgs
2016-09-07 17:26:49.844405 7f5e84a387a0 0 osd.30 65620 load_pgs opened 126 pgs
2016-09-07 17:26:49.926887 7f5e84a387a0 -1 osd.30 65620 log_to_monitors {default=true}
2016-09-07 17:26:49.942048 7f5e18fe8700 0 osd.30 65620 ignoring osdmap until we have initialized
2016-09-07 17:26:49.942154 7f5e18fe8700 0 osd.30 65620 ignoring osdmap until we have initialized

The base issue appears to be that during load_pgs phase of OSD startup - a huge amount of memory is consumed in all OSDs.
Between 17:25:01 & 17:26:49 over 6GB of RAM is consumed during the load_pgs phase (monitored with top). However the memory is not released later.

Example OSD heap profile "ceph tell osd.30 heap dump"

osd.30 dumping heap profile now.
------------------------------------------------
MALLOC:     6658526264 ( 6350.1 MiB) Bytes in use by application
MALLOC: +       131072 (    0.1 MiB) Bytes in page heap freelist
MALLOC: +     27755160 (   26.5 MiB) Bytes in central cache freelist
MALLOC: +      2484224 (    2.4 MiB) Bytes in transfer cache freelist
MALLOC: +      8358192 (    8.0 MiB) Bytes in thread cache freelists
MALLOC: +     14535832 (   13.9 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   6711790744 ( 6400.9 MiB) Actual memory used (physical + swap)
MALLOC: +            0 (    0.0 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   6711790744 ( 6400.9 MiB) Virtual address space used
MALLOC:
MALLOC:         199286              Spans in use
MALLOC:            400              Thread heaps in use
MALLOC:          32768              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Attempts to mitigate memory usage via various tunables (examples below) in ceph.conf were unsuccessful...

osd min pg log entries = 300
osd max pg log entries = 1000
leveldb cache size = 16777216

What other OSD tunables are there which can reduce memory consumption? Documentation does not really cover this issue.

Any other suggestions?


Files

ceph-dr-crush.txt (4.29 KB) ceph-dr-crush.txt Crush map David Burns, 09/07/2016 08:28 AM
ceph-dr-config.txt (32.2 KB) ceph-dr-config.txt Ceph Config David Burns, 09/07/2016 08:31 AM
Actions #1

Updated by huang jun over 7 years ago

you can try jemalloc instead of tcmalloc,
options like osd_client_message_size_cap and osd_client_message_cap also make sense.

Actions #2

Updated by David Burns over 7 years ago

Thanks for the suggestions.

1. Replacing tcmalloc with jemalloc did not reduce memory consumption.

NB We also tested with gperftools 2.2-31 (containing libtcmalloc.so.4.2.1) which also had no impact.

2. Configuring specified options (osd_client_message_size_cap and osd_client_message_cap) had no impact.

To reiterate - all OSDs are showing very high memory consumption (6-10GB per OSD) despite cluster having HEALTH_OK status.

Lastly, we are currently using the following general OSD configuration:

[osd]
osd mkfs type = xfs
osd mkfs options xfs = "-f" 
osd mount options xfs = "rw,noatime,nodiratime,inode64,logbufs=8,logbsize=256k" 
# set lower disk priority for scrubbing
osd disk thread ioprio class = "idle" 
osd disk thread ioprio priority = 7
# reduce leveldb cache
leveldb cache size = 16777216
# reduce OSD map caching
osd map cache size = 50
osd map max advance = 25
osd map share max epochs = 25
osd pg epoch persisted max stale = 25
Actions #3

Updated by David Burns over 7 years ago

We have recently upgraded one of the OSD nodes from CentOS 6.8 to CentOS 7.2.

No change in memory usage was observed (OSDs still using 6-10GB of memory each) - so we don't believe this issue relates to EL6.

Actions #4

Updated by Sage Weil almost 7 years ago

Just saw this. Is this still a problem? Can you attach "ceph osd dump" output? Are you using and erasure coding pool?

Actions #5

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Need More Info
Actions #6

Updated by David Burns almost 7 years ago

Thanks Sage,

This is no longer an issue for this cluster. We have managed to work around the issue by "regenerating" all OSDs
(ceph osd out, wait for redistribution, remove osd, reinit osd then bring new osd back into service).

We discovered this fix by experimentation - although we were close to completely destroying and rebuilding the cluster.

Currently only replicated pools are in use - although we're about to start testing EC pools.

We still have a production Firefly cluster to upgrade to Hammer - we have doubled the installed RAM on all servers as a precaution (installed RAM is ~1.5-2GB per TB of disk).

Actions #7

Updated by Josh Durgin almost 7 years ago

  • Status changed from Need More Info to Can't reproduce

If you do see this again, please re-open and we can debug further.

Actions

Also available in: Atom PDF