Project

General

Profile

Actions

Bug #38489

closed

bluestore_prefer_deferred_size_hdd units are not clear

Added by Марк Коренберг about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have done an experiment. I made a pool with one PG of size 1. Next I run this command:

rados bench -p qwe -b 4M -o 4M -t 1 30 write

And had been monitoring using iostat what happens.

I tried different values using command:

ceph tell osd.14 injectargs '--bluestore_prefer_deferred_size_hdd XXX'

And I realized, that border value for writing/not writing to RocsDB (WAL?) is 524288.

So, using benhmark command above, all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write). This border is exactly 8 times less than 4 MB.

I think it is definitely BUG.

Actions #1

Updated by Brad Hubbard about 5 years ago

  • Project changed from Ceph to bluestore
Actions #2

Updated by Sage Weil about 5 years ago

  • Status changed from New to 4
  • Priority changed from Normal to High

all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write)

Yes. The reason is because the max blob size is 512k by default. And whether to defer a write or not depends on the blob size. So anything > 512k (or multiples of 512k) will result in 512k blobs, and if the prefer_deferred is > 512k everything will go to the wal.

For that reason I don't think there is ever a reason to set deferred > the max blob size. In general, we make this more like 128k or something.

Should we make it so that we implicitly cap the preferred_deferred at the max blob size?

Actions #3

Updated by Марк Коренберг about 5 years ago

Sage Weil wrote:

all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write)

Yes. The reason is because the max blob size is 512k by default. And whether to defer a write or not depends on the blob size. So anything > 512k (or multiples of 512k) will result in 512k blobs, and if the prefer_deferred is > 512k everything will go to the wal.

For that reason I don't think there is ever a reason to set deferred > the max blob size. In general, we make this more like 128k or something.

Should we make it so that we implicitly cap the preferred_deferred at the max blob size?

I don't know. At least connection between this parameter and blob size must be documentd.

I just wanted Bluestore to defer all writes (yes, even blocks of 4MB). I want to have behavior regarding writes just like Filestore with it's journal on SSD. I have very fast SSDs and slow HDDs, so Filestore (for writes) is currently better for me, since my writes are very bursty.

Actions #4

Updated by Igor Fedotov about 5 years ago

I've just verified deferred writes behavior for 4M writes using objectstore FIO plugin.
Indeed bluestore splits writes according to max_blob_size hence one has 8x512K writes for a single 4M user one.
And each 512K write is checked against prefer_deferred_size to decide whether deferred write procedure to be applied.

Having a cap for prefer_deferred_size might help a bit but definitely it's not a complete assistant. IMO one has to learn all this machinery in detail prior to such a tuning.
So that's a matter of detailed documentation on the writing internals...

Actions #5

Updated by Vitaliy Filippov about 5 years ago

I've just tried to set

[osd]
bluestore_prefer_deferred_size_hdd = 4194304

On a test HDD plugged in my laptop. Then I done

ceph daemon osd.0 perf reset
fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=4 -rw=write -pool=bench -rbdname=testimg
ceph daemon osd.0 perf dump|grep bluestore_write_

And got the following:

"bluestore_write_big": 109,
"bluestore_write_big_bytes": 457179136,
"bluestore_write_big_blobs": 872,
"bluestore_write_small": 0,
"bluestore_write_small_bytes": 0,
"bluestore_write_small_unused": 0,
"bluestore_write_small_deferred": 0,
"bluestore_write_small_pre_read": 0,
"bluestore_write_small_new": 0,

OSD is NOT deferring big writes. Why?

Actions #6

Updated by Vitaliy Filippov about 5 years ago

Forgot to mention, this was Ceph 14.1.0

Actions #7

Updated by Sage Weil about 5 years ago

It's not deferring because at the layer that deferring happens, we're talking about blobs (not writes), and the blogs are capped at 512 KiB due to bluestore_max_blob_size.

Actions #8

Updated by Vitaliy Filippov about 5 years ago

So that's why write_big operations may be also deferred just like write_small's. OK, thank you very much, it's clear now

Actions #9

Updated by Neha Ojha about 5 years ago

  • Status changed from 4 to Resolved
Actions #10

Updated by Марк Коренберг about 5 years ago

But wait! documentation is still not fixed!

Actions

Also available in: Atom PDF