Bug #50681: memstore: apparent memory leak when removing objects - RADOS - Ceph

Actions

Copy link

Bug #50681

open

memstore: apparent memory leak when removing objects

Added by Sven Anderson almost 3 years ago. Updated almost 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Performance/Resource Usage

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.11

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When I create and unlink big files like in this¹ little program in my development environment, the OSD daemon keeps claiming more and more memory (using the memstore backend), eventually resulting in an OOM kill. If I limit the memory with "osd memory target" and disable the cache, it just blocks when the memory is used up. If I change to the filestore backend, the memory leak is gone. Although memstore is not meant for production use, it is an issue, when it is used for benchmarking other ceph-related code.

This is my ceph.conf:

[global]
fsid = $(uuidgen)
osd crush chooseleaf type = 0
run dir = ${DIR}/run
auth cluster required = none
auth service required = none
auth client required = none
osd pool default size = 1
mon host = ${HOSTNAME}

[mds.${MDS_NAME}]
host = ${HOSTNAME}

[mon.${MON_NAME}]
log file = ${LOG_DIR}/mon.log
chdir = "" 
mon cluster log file = ${LOG_DIR}/mon-cluster.log
mon data = ${MON_DATA}
mon data avail crit = 0
mon addr = ${HOSTNAME}
mon allow pool delete = true

[osd.0]
log file = ${LOG_DIR}/osd.log
chdir = "" 
osd data = ${OSD_DATA}
osd journal = ${OSD_DATA}.journal
osd journal size = 100
osd objectstore = memstore
osd class load list = *
osd class default list = *
osd_max_object_name_len = 256

[1] https://paste.ee/p/fUmYX

Files

ceph.tar.bz2 (224 KB) ceph.tar.bz2

ceph test cluster data

Sven Anderson, 05/21/2021 03:07 PM

Actions

Copy link

Updated by Sven Anderson almost 3 years ago

The title should say "osd objectstore = memstore"

Actions

Copy link

Updated by Greg Farnum almost 3 years ago

Subject changed from Memory leak when creating and unlinking files with osd objectstore = filestore to Memory leak when creating and unlinking files with osd objectstore = memstore

I’m not totally clear on what you’re doing here and what you think the erroneous behavior is. Memstore only stores data in memory, so of course storing more uses up the memory.

File deleted are not processed instantaneously, but neither are files being automatically snapshotted. The mds has to do background deletes of the relevant objects when a client performs an unlink, but it can’t do that until the client drops all the capabilities for the file in question.

My guess is that you have a mount which is maintaining caps on the files because you’re not generating enough files to push them out of its LRU list, and not waiting for it to decide you’ve lost interest in the files in question.

Actions

Copy link

Updated by Sven Anderson almost 3 years ago

Thanks Greg for your answer. So my expectation was, that at least when there is memory pressure or I am unmounting the cephfs, that the memory from the unlinked files is either returned to the system, or that it is reused for the next run of the benchmark. Did you notice the code snipped, that I linked here: https://paste.ee/p/fUmYX ? That's all I am running. After each run, the RSS of the osd daemon is 2.5GB larger. Since I'm unmounting, I assume all caps are dropped as well. Can I manually trigger the GC in the mds to check if that would solve the issue?

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Project changed from CephFS to RADOS
Category changed from Performance/Resource Usage to Performance/Resource Usage

Actions

Copy link

Updated by Greg Farnum almost 3 years ago

Project changed from RADOS to CephFS
Category changed from Performance/Resource Usage to Performance/Resource Usage

How long did you wait to see if memory usage dropped? Did you look at any logs or dump any pool object info?

I really think you're just seeing the impact of the background file deletion from the MDS. Not sure how to manually trigger it; I think it just runs at what it considers an appropriate rate.

Also, it's memstore: there may be tunings that don't work well on OSDs of this size which we aren't going to fuss over.

Actions

Copy link

Updated by Loïc Dachary almost 3 years ago

Target version deleted (~~v15.2.11~~)

Actions

Copy link

Updated by Sven Anderson almost 3 years ago

File ceph.tar.bz2 ceph.tar.bz2 added

Greg Farnum wrote:

How long did you wait to see if memory usage dropped? Did you look at any logs or dump any pool object info?

For hours. I did look at logs, but I guess I can't interpret if there is something unusual. Please check out the attached files. I also added some command dumps in the out/ subdirectory.

I really think you're just seeing the impact of the background file deletion from the MDS. Not sure how to manually trigger it; I think it just runs at what it considers an appropriate rate.

Also, it's memstore: there may be tunings that don't work well on OSDs of this size which we aren't going to fuss over.

I also tried 4MB files. Same effect.

Actions

Copy link

Updated by Sven Anderson almost 3 years ago

The ceph-osd had a RES memory footprint of 2.6GB while I created above files.

Actions

Copy link

Updated by Greg Farnum almost 3 years ago

Project changed from CephFS to RADOS
Subject changed from Memory leak when creating and unlinking files with osd objectstore = memstore to memstore: apparent memory leak when removing objects
Category changed from Performance/Resource Usage to Performance/Resource Usage
Component(RADOS) OSD added

Sven Anderson wrote:

Greg Farnum wrote:

How long did you wait to see if memory usage dropped? Did you look at any logs or dump any pool object info?

For hours. I did look at logs, but I guess I can't interpret if there is something unusual. Please check out the attached files. I also added some command dumps in the out/ subdirectory.

Okay, well, the pg dump does say there are only like 6 MB of data in RADOS, so that's pretty good evidence it's an issue in memstore.

Thanks for the report and the logs!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #50681

memstore: apparent memory leak when removing objects

Updated by Sven Anderson almost 3 years ago

Updated by Greg Farnum almost 3 years ago

Updated by Sven Anderson almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Greg Farnum almost 3 years ago

Updated by Loïc Dachary almost 3 years ago

Updated by Sven Anderson almost 3 years ago

Updated by Sven Anderson almost 3 years ago

Updated by Greg Farnum almost 3 years ago