Bug #38489
closedbluestore_prefer_deferred_size_hdd units are not clear
0%
Description
I have done an experiment. I made a pool with one PG of size 1. Next I run this command:
rados bench -p qwe -b 4M -o 4M -t 1 30 write
And had been monitoring using iostat what happens.
I tried different values using command:
ceph tell osd.14 injectargs '--bluestore_prefer_deferred_size_hdd XXX'
And I realized, that border value for writing/not writing to RocsDB (WAL?) is 524288.
So, using benhmark command above, all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write). This border is exactly 8 times less than 4 MB.
I think it is definitely BUG.
Updated by Brad Hubbard about 5 years ago
- Project changed from Ceph to bluestore
Updated by Sage Weil about 5 years ago
- Status changed from New to 4
- Priority changed from Normal to High
all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write)
Yes. The reason is because the max blob size is 512k by default. And whether to defer a write or not depends on the blob size. So anything > 512k (or multiples of 512k) will result in 512k blobs, and if the prefer_deferred is > 512k everything will go to the wal.
For that reason I don't think there is ever a reason to set deferred > the max blob size. In general, we make this more like 128k or something.
Should we make it so that we implicitly cap the preferred_deferred at the max blob size?
Updated by Марк Коренберг about 5 years ago
Sage Weil wrote:
all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write)
Yes. The reason is because the max blob size is 512k by default. And whether to defer a write or not depends on the blob size. So anything > 512k (or multiples of 512k) will result in 512k blobs, and if the prefer_deferred is > 512k everything will go to the wal.
For that reason I don't think there is ever a reason to set deferred > the max blob size. In general, we make this more like 128k or something.
Should we make it so that we implicitly cap the preferred_deferred at the max blob size?
I don't know. At least connection between this parameter and blob size must be documentd.
I just wanted Bluestore to defer all writes (yes, even blocks of 4MB). I want to have behavior regarding writes just like Filestore with it's journal on SSD. I have very fast SSDs and slow HDDs, so Filestore (for writes) is currently better for me, since my writes are very bursty.
Updated by Igor Fedotov about 5 years ago
I've just verified deferred writes behavior for 4M writes using objectstore FIO plugin.
Indeed bluestore splits writes according to max_blob_size hence one has 8x512K writes for a single 4M user one.
And each 512K write is checked against prefer_deferred_size to decide whether deferred write procedure to be applied.
Having a cap for prefer_deferred_size might help a bit but definitely it's not a complete assistant. IMO one has to learn all this machinery in detail prior to such a tuning.
So that's a matter of detailed documentation on the writing internals...
Updated by Vitaliy Filippov about 5 years ago
I've just tried to set
[osd]
bluestore_prefer_deferred_size_hdd = 4194304
On a test HDD plugged in my laptop. Then I done
ceph daemon osd.0 perf reset
fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=4 -rw=write -pool=bench -rbdname=testimg
ceph daemon osd.0 perf dump|grep bluestore_write_
And got the following:
"bluestore_write_big": 109,
"bluestore_write_big_bytes": 457179136,
"bluestore_write_big_blobs": 872,
"bluestore_write_small": 0,
"bluestore_write_small_bytes": 0,
"bluestore_write_small_unused": 0,
"bluestore_write_small_deferred": 0,
"bluestore_write_small_pre_read": 0,
"bluestore_write_small_new": 0,
OSD is NOT deferring big writes. Why?
Updated by Vitaliy Filippov about 5 years ago
Forgot to mention, this was Ceph 14.1.0
Updated by Sage Weil about 5 years ago
It's not deferring because at the layer that deferring happens, we're talking about blobs (not writes), and the blogs are capped at 512 KiB due to bluestore_max_blob_size.
Updated by Vitaliy Filippov about 5 years ago
So that's why write_big operations may be also deferred just like write_small's. OK, thank you very much, it's clear now
Updated by Марк Коренберг about 5 years ago
But wait! documentation is still not fixed!