Bug #38489: bluestore_prefer_deferred_size_hdd units are not clear - bluestore - Ceph

Actions

Copy link

Bug #38489

closed

bluestore_prefer_deferred_size_hdd units are not clear

Added by Марк Коренберг about 5 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

Target version:

Ceph - v13.2.5

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.2, Ceph - v13.2.3, Ceph - v13.2.4, Ceph - v13.2.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have done an experiment. I made a pool with one PG of size 1. Next I run this command:

rados bench -p qwe -b 4M -o 4M -t 1 30 write

And had been monitoring using iostat what happens.

I tried different values using command:

ceph tell osd.14 injectargs '--bluestore_prefer_deferred_size_hdd XXX'

And I realized, that border value for writing/not writing to RocsDB (WAL?) is 524288.

So, using benhmark command above, all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write). This border is exactly 8 times less than 4 MB.

I think it is definitely BUG.

Actions

Copy link

Updated by Brad Hubbard about 5 years ago

Project changed from Ceph to bluestore

Actions

Copy link

Updated by Sage Weil about 5 years ago

Status changed from New to 4
Priority changed from Normal to High

all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write)

Yes. The reason is because the max blob size is 512k by default. And whether to defer a write or not depends on the blob size. So anything > 512k (or multiples of 512k) will result in 512k blobs, and if the prefer_deferred is > 512k everything will go to the wal.

For that reason I don't think there is ever a reason to set deferred > the max blob size. In general, we make this more like 128k or something.

Should we make it so that we implicitly cap the preferred_deferred at the max blob size?

Actions

Copy link

Updated by Марк Коренберг about 5 years ago

Sage Weil wrote:

all writes of size 4MB with bluestore_prefer_deferred_size_hdd < 524288 go HDD directly. >= 524288 through SSD (I mean deferred write)

Yes. The reason is because the max blob size is 512k by default. And whether to defer a write or not depends on the blob size. So anything > 512k (or multiples of 512k) will result in 512k blobs, and if the prefer_deferred is > 512k everything will go to the wal.

For that reason I don't think there is ever a reason to set deferred > the max blob size. In general, we make this more like 128k or something.

Should we make it so that we implicitly cap the preferred_deferred at the max blob size?

I don't know. At least connection between this parameter and blob size must be documentd.

I just wanted Bluestore to defer all writes (yes, even blocks of 4MB). I want to have behavior regarding writes just like Filestore with it's journal on SSD. I have very fast SSDs and slow HDDs, so Filestore (for writes) is currently better for me, since my writes are very bursty.

Actions

Copy link

Updated by Igor Fedotov about 5 years ago

I've just verified deferred writes behavior for 4M writes using objectstore FIO plugin.
Indeed bluestore splits writes according to max_blob_size hence one has 8x512K writes for a single 4M user one.
And each 512K write is checked against prefer_deferred_size to decide whether deferred write procedure to be applied.

Having a cap for prefer_deferred_size might help a bit but definitely it's not a complete assistant. IMO one has to learn all this machinery in detail prior to such a tuning.
So that's a matter of detailed documentation on the writing internals...

Actions

Copy link

Updated by Vitaliy Filippov about 5 years ago

I've just tried to set

[osd]
bluestore_prefer_deferred_size_hdd = 4194304

On a test HDD plugged in my laptop. Then I done

ceph daemon osd.0 perf reset
fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=4 -rw=write -pool=bench -rbdname=testimg
ceph daemon osd.0 perf dump|grep bluestore_write_

And got the following:

"bluestore_write_big": 109,
        "bluestore_write_big_bytes": 457179136,
        "bluestore_write_big_blobs": 872,
        "bluestore_write_small": 0,
        "bluestore_write_small_bytes": 0,
        "bluestore_write_small_unused": 0,
        "bluestore_write_small_deferred": 0,
        "bluestore_write_small_pre_read": 0,
        "bluestore_write_small_new": 0,

OSD is NOT deferring big writes. Why?

Actions

Copy link

Updated by Vitaliy Filippov about 5 years ago

Forgot to mention, this was Ceph 14.1.0

Actions

Copy link

Updated by Sage Weil about 5 years ago

It's not deferring because at the layer that deferring happens, we're talking about blobs (not writes), and the blogs are capped at 512 KiB due to bluestore_max_blob_size.

Actions

Copy link