Bug #45765: BlueStore::_collection_list causes huge latency growth pg deletion - bluestore - Ceph

Actions

Copy link

Bug #45765

closed

BlueStore::_collection_list causes huge latency growth pg deletion

Added by Aleksei Zakharov almost 4 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

pacific, octopus, nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.7

ceph-qa-suite:

Pull request ID:

38044

Crash signature (v1):

Crash signature (v2):

Description

Hi!
We have ceph v14.2.7 cluster with about 2 billions of objects. Each object is less than 4K size. One PG have about 1 million of objects.

When we turned on the balancer, we faced with huge latency growth: most of the time write latency is not higher than 700us, and read latency is much less. After some PG's moved from one OSD to another we saw that latency started growing. Even when balancer was turned off and backfill process was finished, latency was up to hundreds of milliseconds.

We did some research and found that OSD spends a lot of time inside PG::_delete_some()->BlueStore::_collection_list()->rocksdb::DBIter::Next()->rocksdb::DBIter::FindNextUserEntryInternal(). This happens during search of the first key.

We set osd_delete_sleep_ssd to 1 to reduce CPU usage during pg deletion and latency of the most requests, but it didn't help in general.
Latency of some of the FindNextUserEntryInternal() calls are up to seconds. I suppose, it happens close the end of all objects in PG deletion:

~# bpftrace -e 'u:/usr/bin/ceph-osd:_ZN7rocksdb6DBIter25FindNextUserEntryInternalEbb /pid==1728609/{@start[tid]=nsecs;} uretprobe:/usr/bin/ceph-osd:_ZN7rocksdb6DBIter25FindNextUserEntryInternalEbb /pid==1728609/{ @lat=(nsecs-@start[tid]); if (@lat>(1000*1000)) { printf("slow: %d ms
ec\n",@lat/1000/1000) } delete(@start[tid]); }'
                                                                                                                                                                                                                                  ~
Attaching 2 probes...
slow: 7159 msec
slow: 7884 msec
slow: 6748 msec
slow: 7140 msec
slow: 7906 msec
slow: 6675 msec
slow: 7136 msec

rocksdb::DBIter::FindNextUserEntryInternal() is spinning inside it's main loop, doing search of the first key. It does more than 100K iterations. To prevent this, it's possible to use rocksdb::ReadOptions::iterate_upper_bound. It looks like cocroachdb developers faced with the same issue: https://www.cockroachlabs.com/blog/adventures-performance-debugging/

Unfortunately it's not very clean for me how to implement iterate_upper_bound usage for bluestore.

Files

compact (23.5 KB) compact

Aleksei Zakharov, 06/03/2020 10:58 AM

Related issues 6 (1 open — 5 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #45765

BlueStore::_collection_list causes huge latency growth pg deletion

Updated by Igor Fedotov almost 4 years ago

Updated by Aleksei Zakharov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Aleksei Zakharov almost 4 years ago

Updated by Aleksei Zakharov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Aleksei Zakharov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Aleksei Zakharov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Aleksei Zakharov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Aleksei Zakharov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by Aleksei Zakharov almost 4 years ago

Updated by Neha Ojha over 3 years ago

Updated by Andy Allan over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by Andy Allan over 3 years ago

Updated by Andy Allan over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by Andy Allan over 3 years ago

Updated by Eric Petit over 3 years ago

Updated by Dan van der Ster over 3 years ago

Updated by Eric Petit over 3 years ago

Updated by Dan van der Ster over 3 years ago

Updated by Dan van der Ster over 3 years ago

Updated by Joshua Baergen over 3 years ago

Updated by Joshua Baergen over 3 years ago

Updated by Joshua Baergen over 3 years ago

Updated by Joshua Baergen over 3 years ago

Updated by Dan van der Ster over 3 years ago

Updated by Joshua Baergen over 3 years ago

Updated by Dan van der Ster over 3 years ago

Updated by Joshua Baergen over 3 years ago

Updated by Andy Allan about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Loïc Dachary about 3 years ago