Project

General

Profile

Actions

Bug #45765

closed

BlueStore::_collection_list causes huge latency growth pg deletion

Added by Aleksei Zakharov almost 4 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi!
We have ceph v14.2.7 cluster with about 2 billions of objects. Each object is less than 4K size. One PG have about 1 million of objects.

When we turned on the balancer, we faced with huge latency growth: most of the time write latency is not higher than 700us, and read latency is much less. After some PG's moved from one OSD to another we saw that latency started growing. Even when balancer was turned off and backfill process was finished, latency was up to hundreds of milliseconds.

We did some research and found that OSD spends a lot of time inside PG::_delete_some()->BlueStore::_collection_list()->rocksdb::DBIter::Next()->rocksdb::DBIter::FindNextUserEntryInternal(). This happens during search of the first key.

We set osd_delete_sleep_ssd to 1 to reduce CPU usage during pg deletion and latency of the most requests, but it didn't help in general.
Latency of some of the FindNextUserEntryInternal() calls are up to seconds. I suppose, it happens close the end of all objects in PG deletion:

~# bpftrace -e 'u:/usr/bin/ceph-osd:_ZN7rocksdb6DBIter25FindNextUserEntryInternalEbb /pid==1728609/{@start[tid]=nsecs;} uretprobe:/usr/bin/ceph-osd:_ZN7rocksdb6DBIter25FindNextUserEntryInternalEbb /pid==1728609/{ @lat=(nsecs-@start[tid]); if (@lat>(1000*1000)) { printf("slow: %d ms
ec\n",@lat/1000/1000) } delete(@start[tid]); }'
                                                                                                                                                                                                                                  ~
Attaching 2 probes...
slow: 7159 msec
slow: 7884 msec
slow: 6748 msec
slow: 7140 msec
slow: 7906 msec
slow: 6675 msec
slow: 7136 msec

rocksdb::DBIter::FindNextUserEntryInternal() is spinning inside it's main loop, doing search of the first key. It does more than 100K iterations. To prevent this, it's possible to use rocksdb::ReadOptions::iterate_upper_bound. It looks like cocroachdb developers faced with the same issue: https://www.cockroachlabs.com/blog/adventures-performance-debugging/

Unfortunately it's not very clean for me how to implement iterate_upper_bound usage for bluestore.


Files

compact (23.5 KB) compact Aleksei Zakharov, 06/03/2020 10:58 AM

Related issues 6 (1 open5 closed)

Related to bluestore - Bug #40741: Mass OSD failure, unable to restartTriagedIgor Fedotov

Actions
Related to bluestore - Bug #45994: OSD crash - in thread tp_osd_tpDuplicate

Actions
Related to RADOS - Bug #47044: PG::_delete_some isn't optimal iterating objectsResolvedIgor Fedotov

Actions
Copied to bluestore - Backport #49964: octopus: BlueStore::_collection_list causes huge latency growth pg deletionResolvedNeha OjhaActions
Copied to bluestore - Backport #49965: pacific: BlueStore::_collection_list causes huge latency growth pg deletionResolvedNeha OjhaActions
Copied to bluestore - Backport #49966: nautilus: BlueStore::_collection_list causes huge latency growth pg deletionResolvedNeha OjhaActions
Actions

Also available in: Atom PDF