Project

General

Profile

Actions

Bug #38559

closed

50-100% iops lost due to bluefs_preextend_wal_files = false

Added by Vitaliy Filippov about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic, nautilus, luminous
Regression:
No
Severity:
1 - critical
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi.

I was investigating why RocksDB performance is so bad considering random 4K iops. I was looking at strace and one thing catched my eye - that was the OSD doing TWO transactions for each small (deferred) incoming write. In both mimic and nautilus strace looks like the following:

There are groups of 5 operations repeated by `bstore_kv_sync`:

  • pwritev(8-12 kb, offset=1440402997248), offsets always increase
  • sync_file_range(just written 8-12 kb)
  • fdatasync()
  • io_submit(op=pwritev, iov_len=4 kb, aio_offset=1455403167744) - offsets differ from first step, but also increase by 4 kb with each write
  • fdatasync()

After every 64 such groups there come some io_submit's from `bstore_kv_final` - this is obviously the application of deferred writes, and the first pwritev is obviously the RocksDB WAL.

But what's the remaining io_submit?

It is the BlueFS's WAL! And all that it seems to do is to increase the size of RocksDB WAL and changing its modification time. So again you have "journaling of the journal"-like issue, as in old days with filestore :)

Then I found the "bluefs_preextend_wal_files" option, and yes, it disables this behaviour when set to true, and random IOPS increase by +50..+100% depending on the workload. But it corrupts the RocksDB when the OSD is shut down uncleanly. It's https://tracker.ceph.com/issues/18338 which I easily reproduced by starting a single OSD locally and writing with "fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=16 -rw=write -pool=bench -rbdname=testimg" into it.

I think this is REALLY ugly. Bluestore is just wasting 1/3 to 1/2 random iops performance. It must be fixed :)


Related issues 3 (0 open3 closed)

Copied to bluestore - Backport #40280: mimic: 50-100% iops lost due to bluefs_preextend_wal_files = falseResolvedNathan CutlerActions
Copied to bluestore - Backport #40281: nautilus: 50-100% iops lost due to bluefs_preextend_wal_files = falseResolvedNathan CutlerActions
Copied to bluestore - Backport #41510: luminous: 50-100% iops lost due to bluefs_preextend_wal_files = falseResolvedActions
Actions

Also available in: Atom PDF