Bug #56488: BlueStore doesn't defer small writes for pre-pacific hdd osds - bluestore - Ceph

Actions

Copy link

Bug #56488

closed

BlueStore doesn't defer small writes for pre-pacific hdd osds

Added by Dan van der Ster almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

Adam Kupczyk

Target version:

Ceph - v18.0.0

% Done:

100%

Source:

Community (dev)

Tags:

Backport:

quincy, pacific

Regression:

Yes

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.2.6, Ceph - v16.2.7, Ceph - v16.2.8, Ceph - v16.2.9, Ceph - v17.0.0, Ceph - v17.2.1, Ceph - v17.2.2

ceph-qa-suite:

Pull request ID:

48490

Crash signature (v1):

Crash signature (v2):

Description

We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something is very wrong with deferred writes in pacific.
I attached a plot from an example cluster, upgraded today.

The OSDs are 12TB HDDs, formatted in nautilus with the default bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.

I found that the performance issue is because 4kB writes are no longer deferred from those pre-pacific hdds to flash in pacific with the default config.
Here are example bench writes from both releases: https://pastebin.com/raw/m0yL1H9Z

I worked out that the issue is fixed if I set bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. Note the default was 32k in octopus).

I think this is related to the fixes in #52089 which landed in 16.2.6 -- _do_alloc_write is now comparing the prealloc size 0x10000 with bluestore_prefer_deferred_size_hdd (0x10000) and the "strictly less than" condition prevents deferred writes from ever happening.

So I think this would impact anyone upgrading clusters with hdd/ssd mixed osds.

Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB or is there in fact a bug here?

Files

image (1).png (126 KB) image (1).png

latency increase after upgrade to pacific

Dan van der Ster, 07/07/2022 08:27 AM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Igor Fedotov almost 2 years ago

Project changed from RADOS to bluestore

Actions

Copy link

Updated by Igor Fedotov almost 2 years ago

Backport set to quincy, pacific

Actions

Copy link

Updated by Adam Kupczyk almost 2 years ago

There are two configurables to consider for deferred writes logic:
- bluestore_prefer_deferred_size "deferred_size"
- bluestore_max_blob_size "blob_size"

PR https://github.com/ceph/ceph/pull/42725 "make deferred writes less aggressive for large writes"
fixed deficiencies we had in code.
1) When write size was exactly same as deferred_size the deferred was triggered.
It was not consistent with config parameter description that stated "smaller then this size".
2) When "blob_size" was <= "deferred_size" then EVERY WRITE went through deferred write mechanism.
This was due to the fact that check for deferred was applied after we split to blobs.
So, for default hdd blob_size=64, deferred_size=64 all data went through deferred.

It is possible that reported effect is actually BlueStore working properly.

I guess it is perfectly legal to set deferred_size to 512K, or to any integer value, like 65537 (if one wants 64K writes to be executed as deferred write).

I do not think this is a bug.

Actions

Copy link

Updated by Igor Fedotov almost 2 years ago

Status changed from New to Fix Under Review
Pull request ID set to 47241

Actions

Copy link

Updated by Gilles Mocellin over 1 year ago

Increasing bluestore_prefer_deferred_size_hdd to 128k and even 512k helped us with slow ops and reccurring high latencies on disks.
See https://tracker.ceph.com/issues/56733

Actions

Copy link

Updated by Igor Fedotov over 1 year ago

Gilles Mocellin wrote:

Increasing bluestore_prefer_deferred_size_hdd to 128k and even 512k helped us with slow ops and reccurring high latencies on disks.
See https://tracker.ceph.com/issues/56733

Gilles, may I ask you to try to set this parameter to 65537 and make sure it helps to avoid high latencies as well. If that's true I can say your case is similar to Dan's one..

Actions

Copy link

Updated by Gilles Mocellin over 1 year ago

I tried for a couple of hours, but it was worst than with 512k. I had a latency plateau and IOPS drops and rollback.
Side effect, my 3 MGR have been out of the cluster. The service and process were still UP, I had to restart the service to have them back, so as the dashboard.

Actions

Copy link

Updated by Gilles Mocellin over 1 year ago

This morning, I have :
PG_NOT_DEEP_SCRUBBED: 11 pgs not deep-scrubbed in time
Never had before Pacific.

Could it be those scrubs/deep scrubs who create latency ? And being slower than before ?

Actions

Copy link

Updated by Konstantin Shalygin over 1 year ago

Igor, PR should be replaced to 48490?

Actions

Copy link

#10

Updated by Igor Fedotov over 1 year ago

Subject changed from pacific doesn't defer small writes for pre-pacific hdd osds to BlueStore doesn't defer small writes for pre-pacific hdd osds
Status changed from Fix Under Review to Pending Backport
Pull request ID changed from 47241 to 48490