Bug #22616
closedbluestore_cache_data uses too much memory
0%
Description
I was running a read throughput test and then found some of my osds were killed by oom killer and restarted.
I found the oom killed osd used much more memory for bluestore_cache_data than the normal ones.
The oom killed osd used 795MB ram in mempool and 722MB in bluestore_cache_data
The normal osd used about 120MB ram in mempool and 17MB in bluestore_cache_data
graph of memory useage of the oom killed osd: https://pasteboard.co/H1GzihS.png
graph of memory useage of the nomral osd: https://pasteboard.co/H1GzaeF.png
my bluestore cache setting
[osd]
osd max backfills = 4
bluestore_cache_size = 134217728
bluestore_cache_kv_max = 134217728
osd client message size cap = 67108864
As far as I know If I use the default cache ratio setting there should be no portion of cache goes into bluestore_cache_data,but the mempool dump data shows otherwise..
Updated by Patrick Donnelly over 6 years ago
- Project changed from Ceph to bluestore
Updated by frank lin over 6 years ago
The work load of read throughput test is 6 fio server with the following parameter
[4m-seq]
description="4m-seq-read"
direct=1
ioengine=libaio
directory=/mnt/cephfs/fio_benchmark/4m/
numjobs=24
iodepth=4
group_reporting
rw=read
bs=4M
size=5G
my osd node has only about 1GB ram for 1 osd.So if bluestore_cache_data uses too much memory then the osd got killed by oom killer.
Updated by Sage Weil over 6 years ago
- Status changed from New to Need More Info
Writes that are in flight to disk show up under bluestore_cache_data, so even if it is not caching anything you'll still see that value be non-zero. This should only happen while under load, however.
What is your memory limit set to? If your memory limits are that tight you probably need to adjust osd_client_message_size_cap, the total amount of memory we are allowed to consume with incoming in-flight requests (default is 500 MB!).
As far as bluestore_cache_data is concerned, I think there is only a bug if the value stays non-zero when the osd goes idle...
Updated by frank lin over 6 years ago
Sage Weil wrote:
Writes that are in flight to disk show up under bluestore_cache_data, so even if it is not caching anything you'll still see that value be non-zero. This should only happen while under load, however.
What is your memory limit set to? If your memory limits are that tight you probably need to adjust osd_client_message_size_cap, the total amount of memory we are allowed to consume with incoming in-flight requests (default is 500 MB!).
As far as bluestore_cache_data is concerned, I think there is only a bug if the value stays non-zero when the osd goes idle...
About "Writes that are in flight to disk show up under bluestore_cache_data", when bluestore_cache_data took more than 700MB ram, I was running 100% read throughput test with zero write.
The total read throughput was about 2.4GB/s for my 48 osds cluster, but some of the osd would get oom killed in minutes if I ran my cluster at that throughput.
When I ran write throughput test the bluestore_cache_data took only 0-20MB ram.
I also found bluestore_cache_data will take more ram as the reading throughput increased if I keep the read throughput capped the situation would be much better.
PS:
I did set osd_client_message_size_cap to 64MB(but I haven't found any detailed information about this setting.Is it a cache for incoming data? Will it accounted in mempool?And I haven't notice set osd_client_message_size_cap to 64MB reduced memory usage)
My memory limit is 1GB for one osd, I know it is very small but I am running ceph on arm server with only 2GB memory.
It will be highly appreciated if you have any advice to reduce the memory usage.
Thanks
Updated by frank lin over 6 years ago
One more fact of my test to add.
I have 48 osd for the test and there were only a few of the osd's bluestore_cache_data went up to about 200-700MB.The rest of the osds are at about 10-30MB.
Updated by Sage Weil over 6 years ago
Two things to try:
- bluestore_default_buffered_read = false should make the problem go away, but is more of a workaround
- bluestore_cache_trim_interval = .05 should reduce the usage (dfeault is .2 = trim every 200ms).. i think this is the actual problem!
Updated by frank lin over 6 years ago
Sage Weil wrote:
Two things to try:
- bluestore_default_buffered_read = false should make the problem go away, but is more of a workaround
- bluestore_cache_trim_interval = .05 should reduce the usage (dfeault is .2 = trim every 200ms).. i think this is the actual problem!
Thanks for the advice.
I have tried to set bluestore_cache_trim_interval to 0.05 and it did lower the memory use but if the read load is high enough some osds would still use more than 500mb in bluestore_cache_data.
Is there a explain why only some osd uses much more memory than others?
I will try bluestore_default_buffered_read = false
BTW:
Is there any advice to further reduce memory usage to make 1GB one osd work without oom problem?
Updated by frank lin over 6 years ago
I did some test with bluestore_default_buffered_read = false
The bluestore_cache_data now only use around a few KB of memory now, but buffer_anon uses around 500MB memory now(this happens only on some of the osds,the osds use more buffer_anon are the same osds that use more bluestore_cache_data).
Updated by Sage Weil about 6 years ago
- Status changed from Need More Info to 12
Ok, I think the thing to do here is make the bluestore trimming a bit more frequent, and have this as a known caveat for very low-memory situations.
Or, perhaps, we could make the read buffered option change if the cache size is below some threshold...
Updated by Sage Weil about 6 years ago
- Status changed from 12 to Fix Under Review
Updated by Kefu Chai about 6 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to luminous
Updated by Nathan Cutler about 6 years ago
- Copied to Backport #23226: luminous: bluestore_cache_data uses too much memory added
Updated by Nathan Cutler about 6 years ago
- Status changed from Pending Backport to Resolved