Bug #38559: 50-100% iops lost due to bluefs_preextend_wal_files = false - bluestore - Ceph

Actions

Copy link

Bug #38559

closed

50-100% iops lost due to bluefs_preextend_wal_files = false

Added by Vitaliy Filippov about 5 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

mimic, nautilus, luminous

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v13.2.4, Ceph - v13.2.5, Ceph - v14.0.0

ceph-qa-suite:

Pull request ID:

26909

Crash signature (v1):

Crash signature (v2):

Description

Hi.

I was investigating why RocksDB performance is so bad considering random 4K iops. I was looking at strace and one thing catched my eye - that was the OSD doing TWO transactions for each small (deferred) incoming write. In both mimic and nautilus strace looks like the following:

There are groups of 5 operations repeated by `bstore_kv_sync`:

pwritev(8-12 kb, offset=1440402997248), offsets always increase
sync_file_range(just written 8-12 kb)
fdatasync()
io_submit(op=pwritev, iov_len=4 kb, aio_offset=1455403167744) - offsets differ from first step, but also increase by 4 kb with each write
fdatasync()

After every 64 such groups there come some io_submit's from `bstore_kv_final` - this is obviously the application of deferred writes, and the first pwritev is obviously the RocksDB WAL.

But what's the remaining io_submit?

It is the BlueFS's WAL! And all that it seems to do is to increase the size of RocksDB WAL and changing its modification time. So again you have "journaling of the journal"-like issue, as in old days with filestore :)

Then I found the "bluefs_preextend_wal_files" option, and yes, it disables this behaviour when set to true, and random IOPS increase by +50..+100% depending on the workload. But it corrupts the RocksDB when the OSD is shut down uncleanly. It's https://tracker.ceph.com/issues/18338 which I easily reproduced by starting a single OSD locally and writing with "fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=16 -rw=write -pool=bench -rbdname=testimg" into it.

I think this is REALLY ugly. Bluestore is just wasting 1/3 to 1/2 random iops performance. It must be fixed :)

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Greg Farnum about 5 years ago

Project changed from Ceph to bluestore

Actions

Copy link

Updated by Sage Weil about 5 years ago

This goes away after you write more metadta into rocksdb and it starts overwriting previous wal files. The purpose of this option is only to make things fast out of the gate for the purposes of convenient benchmarking etc.

Actions

Copy link

Updated by Sage Weil about 5 years ago

Status changed from New to 12

Actions

Copy link

Updated by Vitaliy Filippov about 5 years ago

Yes, I've thought of that but I haven't tested it... However this is rather strange then. Who does the fsync if BlueFS isn't writing to WALs? sync_file_range is called with unsafe flags so it's something else...

In fact I've submitted a pull request here https://github.com/ceph/ceph/pull/26870

It adds more wait flags to sync_file_range, makes it safe and thus makes it possible to enable bluefs_preextend_wal_files. I experience +100% iops on HDD setups with this patch and bluefs_preextend_wal_files set to true.

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Status changed from 12 to Fix Under Review

https://github.com/ceph/ceph/pull/26909

Actions

Copy link

Updated by Kefu Chai almost 5 years ago

Status changed from Fix Under Review to Pending Backport
Backport set to mimic, nautilus
Pull request ID set to 26909

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Copied to Backport #40280: mimic: 50-100% iops lost due to bluefs_preextend_wal_files = false added

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Copied to Backport #40281: nautilus: 50-100% iops lost due to bluefs_preextend_wal_files = false added

Actions

Copy link

Updated by Konstantin Shalygin over 4 years ago

luminous: https://github.com/ceph/ceph/pull/29564

Actions

Copy link

#10

Updated by Nathan Cutler over 4 years ago

Status changed from Pending Backport to Resolved

Having been run with --resolve-parent, the script "backport-create-issue" set the status of this issue to "Resolved" because it determined all backport issues are in status Resolved.

Actions

Copy link

#11

Updated by Nathan Cutler over 4 years ago

Backport changed from mimic, nautilus to mimic, nautilus, luminous

Actions

Copy link

#12

Updated by Nathan Cutler over 4 years ago

Status changed from Resolved to Pending Backport
Target version deleted (~~v14.0.0~~)

Actions

Copy link

#13

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41510: luminous: 50-100% iops lost due to bluefs_preextend_wal_files = false added

Actions

Copy link

#14

Updated by Nathan Cutler over 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #38559

50-100% iops lost due to bluefs_preextend_wal_files = false

Updated by Greg Farnum about 5 years ago

Updated by Sage Weil about 5 years ago

Updated by Sage Weil about 5 years ago

Updated by Vitaliy Filippov about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Kefu Chai almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Konstantin Shalygin over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago