Project

General

Profile

Bug #22616

bluestore_cache_data uses too much memory

Added by frank lin over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
Start date:
01/08/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I was running a read throughput test and then found some of my osds were killed by oom killer and restarted.
I found the oom killed osd used much more memory for bluestore_cache_data than the normal ones.
The oom killed osd used 795MB ram in mempool and 722MB in bluestore_cache_data
The normal osd used about 120MB ram in mempool and 17MB in bluestore_cache_data

graph of memory useage of the oom killed osd: https://pasteboard.co/H1GzihS.png

graph of memory useage of the nomral osd: https://pasteboard.co/H1GzaeF.png

my bluestore cache setting
[osd]

osd max backfills = 4

bluestore_cache_size = 134217728

bluestore_cache_kv_max = 134217728

osd client message size cap = 67108864

As far as I know If I use the default cache ratio setting there should be no portion of cache goes into bluestore_cache_data,but the mempool dump data shows otherwise..


Related issues

Copied to bluestore - Backport #23226: luminous: bluestore_cache_data uses too much memory Resolved

History

#1 Updated by Patrick Donnelly over 1 year ago

  • Project changed from Ceph to bluestore

#2 Updated by frank lin over 1 year ago

The work load of read throughput test is 6 fio server with the following parameter

[4m-seq]
description="4m-seq-read"
direct=1
ioengine=libaio
directory=/mnt/cephfs/fio_benchmark/4m/
numjobs=24
iodepth=4
group_reporting
rw=read
bs=4M
size=5G

my osd node has only about 1GB ram for 1 osd.So if bluestore_cache_data uses too much memory then the osd got killed by oom killer.

#3 Updated by Sage Weil over 1 year ago

  • Status changed from New to Need More Info

Writes that are in flight to disk show up under bluestore_cache_data, so even if it is not caching anything you'll still see that value be non-zero. This should only happen while under load, however.

What is your memory limit set to? If your memory limits are that tight you probably need to adjust osd_client_message_size_cap, the total amount of memory we are allowed to consume with incoming in-flight requests (default is 500 MB!).

As far as bluestore_cache_data is concerned, I think there is only a bug if the value stays non-zero when the osd goes idle...

#4 Updated by frank lin over 1 year ago

Sage Weil wrote:

Writes that are in flight to disk show up under bluestore_cache_data, so even if it is not caching anything you'll still see that value be non-zero. This should only happen while under load, however.

What is your memory limit set to? If your memory limits are that tight you probably need to adjust osd_client_message_size_cap, the total amount of memory we are allowed to consume with incoming in-flight requests (default is 500 MB!).

As far as bluestore_cache_data is concerned, I think there is only a bug if the value stays non-zero when the osd goes idle...

About "Writes that are in flight to disk show up under bluestore_cache_data", when bluestore_cache_data took more than 700MB ram, I was running 100% read throughput test with zero write.

The total read throughput was about 2.4GB/s for my 48 osds cluster, but some of the osd would get oom killed in minutes if I ran my cluster at that throughput.

When I ran write throughput test the bluestore_cache_data took only 0-20MB ram.

I also found bluestore_cache_data will take more ram as the reading throughput increased if I keep the read throughput capped the situation would be much better.

PS:

I did set osd_client_message_size_cap to 64MB(but I haven't found any detailed information about this setting.Is it a cache for incoming data? Will it accounted in mempool?And I haven't notice set osd_client_message_size_cap to 64MB reduced memory usage)

My memory limit is 1GB for one osd, I know it is very small but I am running ceph on arm server with only 2GB memory.

It will be highly appreciated if you have any advice to reduce the memory usage.

Thanks

#5 Updated by frank lin over 1 year ago

One more fact of my test to add.
I have 48 osd for the test and there were only a few of the osd's bluestore_cache_data went up to about 200-700MB.The rest of the osds are at about 10-30MB.

#6 Updated by Sage Weil over 1 year ago

Two things to try:

- bluestore_default_buffered_read = false should make the problem go away, but is more of a workaround

- bluestore_cache_trim_interval = .05 should reduce the usage (dfeault is .2 = trim every 200ms).. i think this is the actual problem!

#7 Updated by frank lin over 1 year ago

Sage Weil wrote:

Two things to try:

- bluestore_default_buffered_read = false should make the problem go away, but is more of a workaround

- bluestore_cache_trim_interval = .05 should reduce the usage (dfeault is .2 = trim every 200ms).. i think this is the actual problem!

Thanks for the advice.
I have tried to set bluestore_cache_trim_interval to 0.05 and it did lower the memory use but if the read load is high enough some osds would still use more than 500mb in bluestore_cache_data.

Is there a explain why only some osd uses much more memory than others?

I will try bluestore_default_buffered_read = false

BTW:
Is there any advice to further reduce memory usage to make 1GB one osd work without oom problem?

#8 Updated by frank lin over 1 year ago

I did some test with bluestore_default_buffered_read = false

The bluestore_cache_data now only use around a few KB of memory now, but buffer_anon uses around 500MB memory now(this happens only on some of the osds,the osds use more buffer_anon are the same osds that use more bluestore_cache_data).

#9 Updated by Sage Weil over 1 year ago

  • Status changed from Need More Info to Verified

Ok, I think the thing to do here is make the bluestore trimming a bit more frequent, and have this as a known caveat for very low-memory situations.

Or, perhaps, we could make the read buffered option change if the cache size is below some threshold...

#10 Updated by Sage Weil over 1 year ago

  • Status changed from Verified to Need Review

#11 Updated by Kefu Chai over 1 year ago

  • Status changed from Need Review to Pending Backport
  • Backport set to luminous

#12 Updated by Nathan Cutler over 1 year ago

  • Copied to Backport #23226: luminous: bluestore_cache_data uses too much memory added

#13 Updated by Nathan Cutler over 1 year ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF