Bug #22616: bluestore_cache_data uses too much memory - bluestore - Ceph

Actions

Copy link

Bug #22616

closed

bluestore_cache_data uses too much memory

Added by frank lin over 6 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Target version:

Ceph - v12.2.2

% Done:

Source:

Tags:

Backport:

luminous

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I was running a read throughput test and then found some of my osds were killed by oom killer and restarted.
I found the oom killed osd used much more memory for bluestore_cache_data than the normal ones.
The oom killed osd used 795MB ram in mempool and 722MB in bluestore_cache_data
The normal osd used about 120MB ram in mempool and 17MB in bluestore_cache_data

graph of memory useage of the oom killed osd: https://pasteboard.co/H1GzihS.png

graph of memory useage of the nomral osd: https://pasteboard.co/H1GzaeF.png

my bluestore cache setting
[osd]

osd max backfills = 4

bluestore_cache_size = 134217728

bluestore_cache_kv_max = 134217728

osd client message size cap = 67108864

As far as I know If I use the default cache ratio setting there should be no portion of cache goes into bluestore_cache_data,but the mempool dump data shows otherwise..

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly over 6 years ago

Project changed from Ceph to bluestore

Actions

Copy link

Updated by frank lin over 6 years ago

The work load of read throughput test is 6 fio server with the following parameter

[4m-seq]
description="4m-seq-read"
direct=1
ioengine=libaio
directory=/mnt/cephfs/fio_benchmark/4m/
numjobs=24
iodepth=4
group_reporting
rw=read
bs=4M
size=5G

my osd node has only about 1GB ram for 1 osd.So if bluestore_cache_data uses too much memory then the osd got killed by oom killer.

Actions

Copy link

Updated by Sage Weil over 6 years ago

Status changed from New to Need More Info

Writes that are in flight to disk show up under bluestore_cache_data, so even if it is not caching anything you'll still see that value be non-zero. This should only happen while under load, however.

What is your memory limit set to? If your memory limits are that tight you probably need to adjust osd_client_message_size_cap, the total amount of memory we are allowed to consume with incoming in-flight requests (default is 500 MB!).

As far as bluestore_cache_data is concerned, I think there is only a bug if the value stays non-zero when the osd goes idle...

Actions

Copy link

Updated by frank lin over 6 years ago

Sage Weil wrote:

Writes that are in flight to disk show up under bluestore_cache_data, so even if it is not caching anything you'll still see that value be non-zero. This should only happen while under load, however.

What is your memory limit set to? If your memory limits are that tight you probably need to adjust osd_client_message_size_cap, the total amount of memory we are allowed to consume with incoming in-flight requests (default is 500 MB!).

As far as bluestore_cache_data is concerned, I think there is only a bug if the value stays non-zero when the osd goes idle...

About "Writes that are in flight to disk show up under bluestore_cache_data", when bluestore_cache_data took more than 700MB ram, I was running 100% read throughput test with zero write.

The total read throughput was about 2.4GB/s for my 48 osds cluster, but some of the osd would get oom killed in minutes if I ran my cluster at that throughput.

When I ran write throughput test the bluestore_cache_data took only 0-20MB ram.

I also found bluestore_cache_data will take more ram as the reading throughput increased if I keep the read throughput capped the situation would be much better.

PS:

I did set osd_client_message_size_cap to 64MB(but I haven't found any detailed information about this setting.Is it a cache for incoming data? Will it accounted in mempool?And I haven't notice set osd_client_message_size_cap to 64MB reduced memory usage)

My memory limit is 1GB for one osd, I know it is very small but I am running ceph on arm server with only 2GB memory.

It will be highly appreciated if you have any advice to reduce the memory usage.

Thanks

Actions

Copy link

Updated by frank lin over 6 years ago

One more fact of my test to add.
I have 48 osd for the test and there were only a few of the osd's bluestore_cache_data went up to about 200-700MB.The rest of the osds are at about 10-30MB.

Actions

Copy link

Updated by Sage Weil over 6 years ago

Two things to try:

- bluestore_default_buffered_read = false should make the problem go away, but is more of a workaround

- bluestore_cache_trim_interval = .05 should reduce the usage (dfeault is .2 = trim every 200ms).. i think this is the actual problem!

Actions

Copy link

Updated by frank lin over 6 years ago

Sage Weil wrote:

Two things to try:

- bluestore_default_buffered_read = false should make the problem go away, but is more of a workaround

- bluestore_cache_trim_interval = .05 should reduce the usage (dfeault is .2 = trim every 200ms).. i think this is the actual problem!

Thanks for the advice.
I have tried to set bluestore_cache_trim_interval to 0.05 and it did lower the memory use but if the read load is high enough some osds would still use more than 500mb in bluestore_cache_data.

Is there a explain why only some osd uses much more memory than others?

I will try bluestore_default_buffered_read = false

BTW:
Is there any advice to further reduce memory usage to make 1GB one osd work without oom problem?

Actions

Copy link

Updated by frank lin over 6 years ago

I did some test with bluestore_default_buffered_read = false

The bluestore_cache_data now only use around a few KB of memory now, but buffer_anon uses around 500MB memory now(this happens only on some of the osds,the osds use more buffer_anon are the same osds that use more bluestore_cache_data).

Actions

Copy link