Bug #55324: rocksdb omap iterators become extremely slow in the presence of large delete range tombstones - bluestore - Ceph

Actions

Copy link

Bug #55324

closed

rocksdb omap iterators become extremely slow in the presence of large delete range tombstones

Added by Cory Snyder about 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

Cory Snyder

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

quincy, pacific

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

45904

Crash signature (v1):

Crash signature (v2):

Description

The high-level problem is a severe performance degradation of RGW bucket listings. The underlying issue is RocksDB range delete tombstones [1]. At the extremes, we were seeing listing operations that normally took a few milliseconds taking upwards of 30 seconds. The problem is compounded by clients that do a lot of bucket listing operations. With such extreme latencies, frequent bucket listing ops quickly clog up OSD op queues and basically bring cluster throughput to a halt. The lack of throughput (particularly writes) prevents effective compaction, and therefore prevents removal of the range delete tombstones that cause the issue. It's basically a positive feedback loop.

The problem seems to mostly manifest itself after the completion of a large RGW bucket reshard operation. When a reshard operation completes successfully, all of the old index objects are removed. When a reshard does not complete successfully, all of the partially-built, new index entries are removed. These bucket index objects are pure omap and they can have a large number of keys/values. Dynamic bucket resharding attempts to keep the size of individual index objects at or less than rgw_max_objs_per_shard, which defaults to 100,000 entries. Due to the bug with dynamic resharding on versioned buckets in Pacific v16.2.7 [2], though, some of these objects were allowed to grow to a few million key pairs in our original problematic cluster. Note that even index objects with 100,000 entries can cause significant problems. Also note that we seem to have experienced this issue in contexts other than bucket resharding, possibly related to remapped PGs causing OSDs to remove large omap objects that they were no longer responsible for - but we don't have enough evidence to make any definite claims in that regard. Ultimately, the problem occurs whenever a large delete range tombstone gets created in RocksDB to clear omap entries for a deleted object.

So why do these delete range tombstones cause such a problem?

An RGW bucket list operation first requires a RocksDB Seek to position the iterator at the first key which is relevant to the listing. Note that bluestore upper_bound and lower_bound omap iterator methods are just thin wrappers around RocksDB Seek operations. In Pacific, omap objects are sharded among three RocksDB column families, p-0, p-1, and p-2 (by default). These shards are logically combined during iterations via the ShardMergeIteratorImpl. The ShardMergeIteratorImpl is simply doing the Seek operation on each shard concurrently, and then sorting the keys among each to provide a single, ordered, logical iteration over all. Note that placement of entries within these shards is dictated by a hash that is constrained by the PG ID, so all objects within the same PG will share a common RocksDB CF shard.

Now imagine the following situation:
1. A reshard operation has just completed and delete range tombstones have been created for all of the old bucket index objects. These tombstones cover 100k+ key pairs in RocksDB.
2. A client makes a bucket listing request for bucket A.
3. There exists one or more OSDs which is the primary for one of the bucket index shard objects of bucket A, and which was a member of the acting set for one of the old bucket index shards. These new/old index shard objects exist on different PGs and those PGs map to different RocksDB omap CF shards. Let's say that the new bucket index shard maps to CF p-1 and the old (deleted) one mapped to p-2.
4. When ShardMergeIteratorImpl performs its Seek in the relevant omap key range on p-1, it returns really quickly because RocksDB does binary search to find the existing key which satisfies the specified upper/lower bound.
5. When ShardMergeIteratorImpl performs its Seek in the relevant omap key range on p-2, it happens that the next key which satisfies the specified upper/lower bound is the first key from the old index shard object (that is now covered by the delete range tombstone). Note that RocksDB does the same binary search to find this key, but then must check at a higher level to see whether it is covered by a tombstone [3]. When RocksDB determines that the key is covered by the tombstone, it continues to iterate over subsequent keys until it finds one that is not deleted. That is, RocksDB loops through the entire deleted key range.

Looping through 100k+ keys is obviously far less efficient than the normal binary search that would happen in the absence of any tombstones. In a cluster with a very large bucket reshard operation, we've seen this cause up to a 3000 fold decrease in bucket listing performance.

[1] http://rocksdb.org/blog/2018/11/21/delete-range.html
[2] https://tracker.ceph.com/issues/51429
[3] http://github.com/cockroachdb/pebble/blob/master/docs/rocksdb.md

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by Cory Snyder about 2 years ago

Pull request ID set to 45904

Actions

Copy link

Updated by Neha Ojha almost 2 years ago

Status changed from New to Fix Under Review

Actions

Copy link

Updated by Neha Ojha almost 2 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Backport Bot almost 2 years ago

Copied to Backport #55441: quincy: rocksdb omap iterators become extremely slow in the presence of large delete range tombstones added

Actions

Copy link

Updated by Backport Bot almost 2 years ago

Copied to Backport #55442: pacific: rocksdb omap iterators become extremely slow in the presence of large delete range tombstones added

Actions

Copy link

Updated by Yuri Weinstein almost 2 years ago

https://github.com/ceph/ceph/pull/45963 merged

Actions

Copy link

Updated by Ilya Dryomov almost 2 years ago

Related to Bug #55444: test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test added

Actions

Copy link

Updated by Wes Dillingham almost 2 years ago

Would this be applicable to CephFS as well? We are seeing significant impact (osd slow ops) after deletes on a 16.2.7 EC CephFS cluster.

Actions

Copy link

Updated by Stefan Kooman almost 2 years ago

Like @Wes Dillingham I would like to know if CephFS performs operations that can trigger similar behavior (directory fragmentation?)? Does this only affect Pacific, or is this also affecting Octopus (and older) releases? We experience slow ops during remaps / (deep-)scrubs on Cephfs metadata pools.

Actions

Copy link

#10

Updated by Igor Fedotov almost 2 years ago

Stefan Kooman wrote:

Like @Wes Dillingham I would like to know if CephFS performs operations that can trigger similar behavior (directory fragmentation?)? Does this only affect Pacific, or is this also affecting Octopus (and older) releases? We experience slow ops during remaps / (deep-)scrubs on Cephfs metadata pools.

Yes, potentially this might affect any subsystem which [extensively] accesses RocksDB.
And the issue is rather applicable to Octopus and older releases.

Actions

Copy link

#11

Updated by Backport Bot over 1 year ago

Tags set to backport_processed

Actions

Copy link

#12

Updated by Igor Fedotov over 1 year ago

Status changed from Pending Backport to Resolved

Actions

Copy link

#13

Updated by Anonymous over 1 year ago

I don't see the PR showing up in any release notes. I assume this was not backported to the last octopus release? In which pacific release is it?

Thanks!

Actions

Copy link

#14

Updated by Anonymous over 1 year ago

Sven Kieske wrote:

I don't see the PR showing up in any release notes. I assume this was not backported to the last octopus release? In which pacific release is it?

Thanks!

I see this was backported in: https://github.com/ceph/ceph/pull/45963 but was later reverted in https://github.com/ceph/ceph/pull/46092

so this is not fixed in pacific, or was there another backport attempted that I missed?

Actions

Copy link

#15

Updated by Benoît Knecht over 1 year ago

I see this was backported in: https://github.com/ceph/ceph/pull/45963 but was later reverted in https://github.com/ceph/ceph/pull/46092

so this is not fixed in pacific, or was there another backport attempted that I missed?

I think it was reintroduced in https://github.com/ceph/ceph/pull/46096, which was released in Pacific 16.2.8.

Actions

Copy link

#16

Updated by Konstantin Shalygin over 1 year ago

Sven Kieske wrote:

I assume this was not backported to the last octopus release?

Yes, the octopus is EOL

Actions

Copy link

#17

Updated by Igor Fedotov over 1 year ago

Benoît Knecht wrote:

I see this was backported in: https://github.com/ceph/ceph/pull/45963 but was later reverted in https://github.com/ceph/ceph/pull/46092

so this is not fixed in pacific, or was there another backport attempted that I missed?

I think it was reintroduced in https://github.com/ceph/ceph/pull/46096, which was released in Pacific 16.2.8.

Right!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #55324

rocksdb omap iterators become extremely slow in the presence of large delete range tombstones

Updated by Cory Snyder about 2 years ago

Updated by Neha Ojha almost 2 years ago

Updated by Neha Ojha almost 2 years ago

Updated by Backport Bot almost 2 years ago

Updated by Backport Bot almost 2 years ago

Updated by Yuri Weinstein almost 2 years ago

Updated by Ilya Dryomov almost 2 years ago

Updated by Wes Dillingham almost 2 years ago

Updated by Stefan Kooman almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Backport Bot over 1 year ago

Updated by Igor Fedotov over 1 year ago

Updated by Anonymous over 1 year ago

Updated by Anonymous over 1 year ago

Updated by Benoît Knecht over 1 year ago

Updated by Konstantin Shalygin over 1 year ago

Updated by Igor Fedotov over 1 year ago