Bug #38489
closed
bluestore_prefer_deferred_size_hdd units are not clear
Added by Марк Коренберг about 5 years ago.
Updated about 5 years ago.
Description
I have done an experiment. I made a pool with one PG of size 1. Next I run this command:
rados bench -p qwe -b 4M -o 4M -t 1 30 write
And had been monitoring using iostat what happens.
I tried different values using command:
ceph tell osd.14 injectargs '--bluestore_prefer_deferred_size_hdd XXX'
And I realized, that border value for writing/not writing to RocsDB (WAL?) is 524288.
So, using benhmark command above, all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write). This border is exactly 8 times less than 4 MB.
I think it is definitely BUG.
- Project changed from Ceph to bluestore
- Status changed from New to 4
- Priority changed from Normal to High
all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write)
Yes. The reason is because the max blob size is 512k by default. And whether to defer a write or not depends on the blob size. So anything > 512k (or multiples of 512k) will result in 512k blobs, and if the prefer_deferred is > 512k everything will go to the wal.
For that reason I don't think there is ever a reason to set deferred > the max blob size. In general, we make this more like 128k or something.
Should we make it so that we implicitly cap the preferred_deferred at the max blob size?
Sage Weil wrote:
all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write)
Yes. The reason is because the max blob size is 512k by default. And whether to defer a write or not depends on the blob size. So anything > 512k (or multiples of 512k) will result in 512k blobs, and if the prefer_deferred is > 512k everything will go to the wal.
For that reason I don't think there is ever a reason to set deferred > the max blob size. In general, we make this more like 128k or something.
Should we make it so that we implicitly cap the preferred_deferred at the max blob size?
I don't know. At least connection between this parameter and blob size must be documentd.
I just wanted Bluestore to defer all writes (yes, even blocks of 4MB). I want to have behavior regarding writes just like Filestore with it's journal on SSD. I have very fast SSDs and slow HDDs, so Filestore (for writes) is currently better for me, since my writes are very bursty.
I've just verified deferred writes behavior for 4M writes using objectstore FIO plugin.
Indeed bluestore splits writes according to max_blob_size hence one has 8x512K writes for a single 4M user one.
And each 512K write is checked against prefer_deferred_size to decide whether deferred write procedure to be applied.
Having a cap for prefer_deferred_size might help a bit but definitely it's not a complete assistant. IMO one has to learn all this machinery in detail prior to such a tuning.
So that's a matter of detailed documentation on the writing internals...
I've just tried to set
[osd]
bluestore_prefer_deferred_size_hdd = 4194304
On a test HDD plugged in my laptop. Then I done
ceph daemon osd.0 perf reset
fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=4 -rw=write -pool=bench -rbdname=testimg
ceph daemon osd.0 perf dump|grep bluestore_write_
And got the following:
"bluestore_write_big": 109,
"bluestore_write_big_bytes": 457179136,
"bluestore_write_big_blobs": 872,
"bluestore_write_small": 0,
"bluestore_write_small_bytes": 0,
"bluestore_write_small_unused": 0,
"bluestore_write_small_deferred": 0,
"bluestore_write_small_pre_read": 0,
"bluestore_write_small_new": 0,
OSD is NOT deferring big writes. Why?
Forgot to mention, this was Ceph 14.1.0
It's not deferring because at the layer that deferring happens, we're talking about blobs (not writes), and the blogs are capped at 512 KiB due to bluestore_max_blob_size.
So that's why write_big operations may be also deferred just like write_small's. OK, thank you very much, it's clear now
- Status changed from 4 to Resolved
But wait! documentation is still not fixed!
Also available in: Atom
PDF