Bug #23510: rocksdb spillover for hard drive configurations - RADOS - Ceph

Actions

Copy link

Bug #23510

closed

rocksdb spillover for hard drive configurations

Added by Ben England about 6 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Performance/Resource Usage

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.1

ceph-qa-suite:

Component(RADOS):

BlueStore

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

version: ceph-*-12.2.1-34.el7cp.x86_64

One of Bluestore's best use cases is to accelerate performance for writes of metadata-intensive workloads such as RGW or Cephfs, particularly on HDD configurations with SSD available for block.wal and block.db. However, in RGW testing done by Red Hat Perf & Scale team (John Harrigan) at scale with 50% full cluster, RocksDB consistently overflowed onto HDD, and not just a little. I'm still developing comprehensive dataset that shows this, but it's easy to see using ceph daemon osd.NNN perf dump and looking at "bluefs" counters. Here's a representative example. Here we have a 10-GB RocksDB partition, a 5-GB WAL and a 2-TB HDD. The OSD counters are [here](https://pastebin.com/pD02b42T), but the short version is that we have plenty of free space in the SSD partition in rocks DB but rocksDB is still using space on the HDD.

bluefs:
"db_total_bytes": 10239336448,
"db_used_bytes": 415236096,
"slow_total_bytes": 79989571584,
"slow_used_bytes": 2054160384,
osd:
"stat_bytes_used": 50741678080,
"stat_bytes_avail": 1959235878912,

This means that rocksDB has 10 GB of space in its SSD partition but it is currently only using 0.4 GB there, or 4%, but it has allocated 80 GB of space on the HDD device and is currently using 2 GB, or 20%. So it is using 5 times as much space on the slow device as the fast device. Contrast this with zero used space on the slow device when the OSD is first created. So if you age the OSD by repeatedly creating and deleting objects, you get this behavior, which should significantly increase latency of metadata operations on the OSD. There is not enough data to prove this yet, but we have measured increasing RGW latency using COSbench coincidentally with this spillover, more than once.

Questions:
- What amount of SSD partition space is sufficient to prevent spillover?
- if Spillover has occurred, can it be repaired? Can we manually initiate compaction of rocksDB to force it back onto the SSD partition?
- why is 80 GB of space allocated on the HDD for RocksDB? Was that much space actually used in the past?
- Why is the spillover happening if there is plenty of free space? Can Ceph prevent this?

Thx -ben

Actions

Copy link

Updated by Igor Fedotov about 6 years ago

Ben,
this has been fixed by https://github.com/ceph/ceph/pull/19257
Not sure about an exact Luminous build it landed in but that was definitely not v12.2.1. Suppose it was v12.2.3.

Actions

Copy link

Updated by Nathan Cutler about 6 years ago

Igor Fedotov wrote:

Ben,
this has been fixed by https://github.com/ceph/ceph/pull/19257
Not sure about an exact Luminous build it landed in but that was definitely not v12.2.1. Suppose it was v12.2.3.

Yes, that PR was released in v12.2.3

Actions

Copy link