Project

General

Profile

Bug #45765

BlueStore::_collection_list causes huge latency growth pg deletion

Added by Aleksei Zakharov over 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi!
We have ceph v14.2.7 cluster with about 2 billions of objects. Each object is less than 4K size. One PG have about 1 million of objects.

When we turned on the balancer, we faced with huge latency growth: most of the time write latency is not higher than 700us, and read latency is much less. After some PG's moved from one OSD to another we saw that latency started growing. Even when balancer was turned off and backfill process was finished, latency was up to hundreds of milliseconds.

We did some research and found that OSD spends a lot of time inside PG::_delete_some()->BlueStore::_collection_list()->rocksdb::DBIter::Next()->rocksdb::DBIter::FindNextUserEntryInternal(). This happens during search of the first key.

We set osd_delete_sleep_ssd to 1 to reduce CPU usage during pg deletion and latency of the most requests, but it didn't help in general.
Latency of some of the FindNextUserEntryInternal() calls are up to seconds. I suppose, it happens close the end of all objects in PG deletion:

~# bpftrace -e 'u:/usr/bin/ceph-osd:_ZN7rocksdb6DBIter25FindNextUserEntryInternalEbb /pid==1728609/{@start[tid]=nsecs;} uretprobe:/usr/bin/ceph-osd:_ZN7rocksdb6DBIter25FindNextUserEntryInternalEbb /pid==1728609/{ @lat=(nsecs-@start[tid]); if (@lat>(1000*1000)) { printf("slow: %d ms
ec\n",@lat/1000/1000) } delete(@start[tid]); }'
                                                                                                                                                                                                                                  ~
Attaching 2 probes...
slow: 7159 msec
slow: 7884 msec
slow: 6748 msec
slow: 7140 msec
slow: 7906 msec
slow: 6675 msec
slow: 7136 msec

rocksdb::DBIter::FindNextUserEntryInternal() is spinning inside it's main loop, doing search of the first key. It does more than 100K iterations. To prevent this, it's possible to use rocksdb::ReadOptions::iterate_upper_bound. It looks like cocroachdb developers faced with the same issue: https://www.cockroachlabs.com/blog/adventures-performance-debugging/

Unfortunately it's not very clean for me how to implement iterate_upper_bound usage for bluestore.

compact (23.5 KB) Aleksei Zakharov, 06/03/2020 10:58 AM


Related issues

Related to bluestore - Bug #40741: Mass OSD failure, unable to restart Triaged
Related to bluestore - Bug #45994: OSD crash - in thread tp_osd_tp Duplicate
Related to RADOS - Bug #47044: PG::_delete_some isn't optimal iterating objects Resolved
Copied to bluestore - Backport #49964: octopus: BlueStore::_collection_list causes huge latency growth pg deletion Resolved
Copied to bluestore - Backport #49965: pacific: BlueStore::_collection_list causes huge latency growth pg deletion Resolved
Copied to bluestore - Backport #49966: nautilus: BlueStore::_collection_list causes huge latency growth pg deletion Resolved

History

#1 Updated by Igor Fedotov over 2 years ago

Yeah, that's a known issue with RocksDB/BlueStore.

Manual compaction using "ceph-kvstore-tool bluestore-kv <path-to-osd> compact" command might temporarily help but the root cause is rather a mismatch between high DB load and slow hw backing it. AFAIR you have spinner-only OSDs, haven't you?

#2 Updated by Aleksei Zakharov over 2 years ago

Nope, we use NVMe SSD's.

The issue starts with high page cache usage. Bluefs uses buffered IO, it reads a lot from page cache. At this moment latency of FindNextUserEntryInternal() is measured in milliseconds (which is slow, it shouldn't be so). Hours later we see more and more "active" pages in the page cache and RocksDB starts prefetching data from the NVMe drives. It does about 400MB/s reads during 3 hours. It looks fast enough, isn't it? But RocksDB just needs to read too much on every rocksdb::DBIter::Next() call.

#3 Updated by Igor Fedotov over 2 years ago

Anyway suggesting to compact DB manually.

Besides recently we switched backed to direct IO for bluefs, see https://github.com/ceph/ceph/pull/34297

Likely your nautilus release hasn't got this modification yet though.
So you might want to tune your cluster's settings by yourself. Please make sure bluefs_preextend_wal_files is set to false (which is a default for Nautilus) in this case - setting it to true might cause data corruption, see https://tracker.ceph.com/issues/45613 for details.

#4 Updated by Aleksei Zakharov over 2 years ago

I will try manual compaction, thanks!

I'm not sure that turning direct IO on for bluefs can help somehow in this particular case. Not all OSD's at a time are affected by this issue, so shared cache helps to mitigate disk IO for a significant amount of time. If we turn buffered io off, then it will read from disk.

#5 Updated by Aleksei Zakharov over 2 years ago

Unfortunately manual compaction doesn't work. Error log is attached.

#6 Updated by Igor Fedotov over 2 years ago

So this OSD doesn't start any more, does it?
Please set debug-bluestore to 20 and collect the log if so.

#7 Updated by Igor Fedotov over 2 years ago

  • Project changed from Ceph to bluestore

#8 Updated by Aleksei Zakharov over 2 years ago

It starts and works ok. The only issue with it is the latency issue described earlier.

#9 Updated by Igor Fedotov over 2 years ago

So it's failing during compaction only, right? It's even more interesting, could you please share more verbose ?log?

#10 Updated by Aleksei Zakharov over 2 years ago

It is failing during any ceph-kvstore-tool command. It doesn't matter compaction/stats or list. Looks like another issue, doesn't it?:)
Result of ceph-kvstore-tool start with --debug_bluestore=20/20 --debug_bluefs=20/20 --debug_rocksdb=20/20: https://166924.selcdn.ru/links/compact_verbose.tar.gz

#11 Updated by Igor Fedotov over 2 years ago

Looks like you're using custom bluefs_shared_alloc_size setting set to 4K. And kvstore_tool is unaware of that and tries to use default 64K value (note 10000 value at the end):
2020-06-05 12:14:27.719 7fba21347d80 10 fbmap_alloc 0x21fe600 BitmapAllocator 0xe3fe700000/10000

Which finally results in assertion when marking 4K extent as free:
-2> 2020-06-05 12:14:29.015 7fba21347d80 10 fbmap_alloc 0x21fe600 init_rm_free 0x8480763000~1000

You can run kvstore tool using custom settings using CEPH_ARGS env variable:
CEPH_ARGS="--bluefs-shared-alloc-size 65536" ceph-kvstore-tool ...

#12 Updated by Igor Fedotov over 2 years ago

Proper command line for my previous comment would be:
CEPH_ARGS="--bluefs-shared-alloc-size 4096" ceph-kvstore-too

#13 Updated by Aleksei Zakharov over 2 years ago

Thanks, it works now!
Do we need to run manual compaction, when we see high latency? It takes about 30 minutes to compact one OSD.

#14 Updated by Igor Fedotov over 2 years ago

Manual compaction is a workaround for slow KV access which tends to be cause by prior massive data removals. In regular circumstances RocksDB performs automatic compaction periodically and it's enough for normal everyday operations. But as I said massive removals might cause performance drop sometimes. Then manual compactions help.

#15 Updated by Igor Fedotov over 2 years ago

  • Related to Bug #40741: Mass OSD failure, unable to restart added

#16 Updated by Aleksei Zakharov over 2 years ago

Unfortunately, manual compaction of one OSD takes about 30 minutes. So, in this situation we're forced to do a lot of manual actions during pool scaling process: watch for latency and stop some OSD's. Also, it leads to cluster working without fault-tolerance during compaction. :(
Are there any chances for a patch in the nearest future? I'm inquiring because we need to know if we need to take any actions on this pool.

#17 Updated by Igor Fedotov over 2 years ago

@Aleksei,
first of all manual compaction isn't supposed to be the regular mean to operate with. Generally RocksDB performs compaction automatically on its own which is sufficient in most of cases. But under some circumstances (very KV-intensive (and prolongated?) load, insufficient Ceph resources, massive KV removals) DB might get into this "slow" state where manual compaction helps. Unfortunately for now I don't know how one can fix this issue totally at BlueStore/RocksDB level.
Hence absolutely no expectation on the patch.

IMO you should revise what's happening at upper levels and try to change use pattern in a way to reduce massive object removals(migrations over OSDs).
Are you using RGW? If so for instance you might want to check Mark's Nelson comment at https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/XGVON4NNONIZCRCHFY5FVLWNG4STPGLE/ about bucket resharding.

Unfortunately I'm not an RGW expert hence my abilities to help are very limited on this path.

Nevertheless for the sake of information recording some questions, please:
1) Does manual compaction help for a while in your case?
2) Are compaction times always that high?
3) A while ago I made a putch to speed up pool/pg removal. See https://github.com/ceph/ceph/pull/28699
Would you be able to make and try a custom build with this patch? I can prepare a diff for a specific Nautilus release if needed.

#18 Updated by Igor Fedotov over 2 years ago

  • Related to Bug #45994: OSD crash - in thread tp_osd_tp added

#19 Updated by Aleksei Zakharov over 2 years ago

We don't use RGW, we use self-written client which operates with small objects (~500-700-byte objects). Load is not that high - about 200 read ops/osd and 50-100 write ops/osd during normal operation. It uses plain rados objects, no omap. It works perfect most of the time. We faced with this issue when we tried to move some PG's.

1. Yes, manual compation helps in our case.
2. Yes, it always takes about 30 minutes.
3. Thanks, we'll try this patch, it will take some time. At first glance this one might help.

#20 Updated by Neha Ojha over 2 years ago

  • Related to Bug #47044: PG::_delete_some isn't optimal iterating objects added

#21 Updated by Andy Allan about 2 years ago

I'm experiencing a similar problem with our brand new octopus cluster (default settings, radosgw only, 18 NVMe OSDs, 352.21M objects, 1.5 TiB data). Whenever an OSD is deleting objects, it slowly consumes all io and makes increasingly little progress until it (eventually) completes.

I've inadvertently found 3 ways to trigger the problem:

  • Removing an S3 bucket (i.e. s3cmd rb). Took only a few hours to fill, but then triggered days of disk thrashing after deletion, accompanied by space usage slowly falling.
  • Adding more OSDs. The backfill onto the new OSDs completed quickly, but the original OSDs were then maxxed out for around a week as the disk usage again fell slowly.
  • Encountering an pg autoscale event. This scaled our default.rgw.buckets.data bucket from 64 to 128 pgs, and has been a complete nightmare. Each time an OSD finishes a backfill, it then starts thrashing with the deletions. Soon that OSD will be involved in another backfill, which makes almost no progress as the deletions are still running. The cluster quickly enters a state where every backfill involves a thrashing OSD (doing previous deletions), and recovery progress drops from 1-10 MB/s to around 100 B/s. Taking an OSD offline and compacting it (with ceph-kvstore-tool, takes 10-15mins) speeds up progress for a while. But when the next backfill on that OSD finishes, it starts thrashing all over again. Repeatedly manually compacting each osd (only 18 OSDs, but 64 pgs to backfill, happening one or two at a time) becomes a long-running game of whack-a-mole.

In several cases, our df charts show a smooth decay curve (instead of a linear drop in disk usage over time). This leads me to think that #47044 could well be the underlying cause, since it fits with deletions starting out quickly but slowing down over time. It could also explain why only clusters like ours, with high numbers of objects per pg, are really noticing this problem.

#22 Updated by Igor Fedotov about 2 years ago

Andy Allan wrote:

I'm experiencing a similar problem with our brand new octopus cluster (default settings, radosgw only, 18 NVMe OSDs, 352.21M objects, 1.5 TiB data). Whenever an OSD is deleting objects, it slowly consumes all io and makes increasingly little progress until it (eventually) completes.

I've inadvertently found 3 ways to trigger the problem:

  • Removing an S3 bucket (i.e. s3cmd rb). Took only a few hours to fill, but then triggered days of disk thrashing after deletion, accompanied by space usage slowly falling.
  • Adding more OSDs. The backfill onto the new OSDs completed quickly, but the original OSDs were then maxxed out for around a week as the disk usage again fell slowly.
  • Encountering an pg autoscale event. This scaled our default.rgw.buckets.data bucket from 64 to 128 pgs, and has been a complete nightmare. Each time an OSD finishes a backfill, it then starts thrashing with the deletions. Soon that OSD will be involved in another backfill, which makes almost no progress as the deletions are still running. The cluster quickly enters a state where every backfill involves a thrashing OSD (doing previous deletions), and recovery progress drops from 1-10 MB/s to around 100 B/s. Taking an OSD offline and compacting it (with ceph-kvstore-tool, takes 10-15mins) speeds up progress for a while. But when the next backfill on that OSD finishes, it starts thrashing all over again. Repeatedly manually compacting each osd (only 18 OSDs, but 64 pgs to backfill, happening one or two at a time) becomes a long-running game of whack-a-mole.

In several cases, our df charts show a smooth decay curve (instead of a linear drop in disk usage over time). This leads me to think that #47044 could well be the underlying cause, since it fits with deletions starting out quickly but slowing down over time. It could also explain why only clusters like ours, with high numbers of objects per pg, are really noticing this problem.

Hi Andy,
looks like there is a bunch of issues with PG removal but in your case you might be still suffering from RocksDB's rm_delete_range utilization which creates multiple tombstones in DB. The latter is known to cause slow DB access, see https://github.com/ceph/ceph/pull/33439 and https://github.com/ceph/ceph/pull/31442

I presume that under some circumstances one might overcome 1M entry threshold and rm_delete_ranges are applied.
Could you please raise rocksdb_delete_range_threshold to e.g. 1 billion. This wouldn't make removal process faster but might decrease/eliminate I/O thrashing.
Additionally you might want to tune osd_delete_sleep_ssd setting and for instance raise it 0.05 (or higher). This should make removal process less intrusive.

Better to apply the above recommendations one-by-one and learn which option is helpful if any.

I'm in the process of fixing this issue and the above will help me in understanding how good I realize the problem exposure in the field. Thanks in advance!

#23 Updated by Andy Allan about 2 years ago

Thanks for your help Igor.

Could you please raise rocksdb_delete_range_threshold to e.g. 1 billion. This wouldn't make removal process faster but might decrease/eliminate I/O thrashing.

I set this on all the OSDs (via ceph config set) and also restarted every OSD, just in case that was required. Unfortunately there has been no change in behaviour after 36 hours. After 24 hours I chose some OSDs that were thrashing, and ran the offline compaction, but they have also now returned to the thrashing state again.

Additionally you might want to tune osd_delete_sleep_ssd setting and for instance raise it 0.05 (or higher). This should make removal process less intrusive.

I'll set that tomorrow and see what happens.

#24 Updated by Andy Allan about 2 years ago

Andy Allan wrote:

I'll set that tomorrow and see what happens.

Setting osd_delete_sleep_ssd to 0.05 didn't make much difference - perhaps slightly faster to backfill during thrashing, but not significantly. And of course the thrashing continued too.

The backfilling finished on the 26th, so about 32 days(!) after the pg autoscale event. The thrashing continues on some disks, along with slowly falling disk space usage on those disks. I take the affected ones offline, one by one, each day for compaction. But sometimes they start thrashing again within a couple of hours, and now that backfilling is finished I don't know what could be triggering this - perhaps there is some other compaction going on, and ceph-kvstore-tool bluestore-kv isn't helping with that. However, I expect that all the thrashing will cease over the next few days, given past behaviour.

I'm happy to provide more data or try different things, if that's helpful. In any case, I thought it would be illustrative to share my experience here.

#25 Updated by Igor Fedotov about 2 years ago

Andy, thanks for the update.
Can you give a brief try to osd_delete_sleep_ssd = 1? Most probably this wouldn't speed up the removal process but hopefully will eliminate the thrashing.

#26 Updated by Andy Allan about 2 years ago

Can you give a brief try to osd_delete_sleep_ssd = 1? Most probably this wouldn't speed up the removal process but hopefully will eliminate the thrashing.

I set that yesterday, but after 16 hours changing it hasn't made any noticeable difference in our disk utilization graphs.

#27 Updated by Eric Petit about 2 years ago

Besides recently we switched backed to direct IO for bluefs, see https://github.com/ceph/ceph/pull/34297
Likely your nautilus release hasn't got this modification yet though.

Not sure if directly related to this issue, but FWIW I upgraded clusters from 14.2.8 to 14.2.11 recently and immediately started to see a lot of disk trashing and slow requests from PG removals, which stopped once I forced bluefs_buffered_io back to true. The page cache appears to help significantly in my case - having large spinners and a rocksdb significantly larger than what may fit in osd_memory_target.

#28 Updated by Dan van der Ster about 2 years ago

Eric Petit wrote:

Besides recently we switched backed to direct IO for bluefs, see https://github.com/ceph/ceph/pull/34297
Likely your nautilus release hasn't got this modification yet though.

Not sure if directly related to this issue, but FWIW I upgraded clusters from 14.2.8 to 14.2.11 recently and immediately started to see a lot of disk trashing and slow requests from PG removals, which stopped once I forced bluefs_buffered_io back to true. The page cache appears to help significantly in my case - having large spinners and a rocksdb significantly larger than what may fit in osd_memory_target.

Eric How is that @bluefs_buffered_io = true working for you? We are considering to re-enable it to help workaround the pg deletion slowness.

#29 Updated by Eric Petit about 2 years ago

Eric How is that @bluefs_buffered_io = true working for you? We are considering to re-enable it to help workaround the pg deletion slowness.

Working fine for me - been about a month now and several rounds of rebalancing from replacing failing disks, I have not noticed slowness from PG deletion again nor any other side effects

#30 Updated by Dan van der Ster about 2 years ago

Eric Petit wrote:

Eric How is that @bluefs_buffered_io = true working for you? We are considering to re-enable it to help workaround the pg deletion slowness.

Working fine for me - been about a month now and several rounds of rebalancing from replacing failing disks, I have not noticed slowness from PG deletion again nor any other side effects

Thanks for the reply.
I've submitted a pr to make this the default again: https://github.com/ceph/ceph/pull/38044

#31 Updated by Dan van der Ster about 2 years ago

@Igor another dimension to this that I haven't seen discussed yet -- AFAIU, PG deletion happens concurrently, for example during pool deletion or if an OSD is so slow at deleting PGs that they pile up and eventually several are deleting in parallel.

In this case, AFAICT, the osd_delete_sleep is applied per PG, so if for example we have 10 PG deletions ongoing, then the effective osd_delete_sleep would be 1/10th the configured value.

I don't fully understand what limits the number of concurrent PG deletions -- perhaps this is capped by the osd_op_num_shards and osd_op_num_threads_per_shard ?

Anyway, do you think this should be taken into account in the work to optimize PG removal?

#32 Updated by Joshua Baergen almost 2 years ago

I doubt it helps, but I just wanted to add a "me too" here on 14.2.11. We're augmenting a cluster and had moved a few hundred PGs, after which the cluster was essentially crippled until we enabled bluefs_buffered_io. Compacting did indeed help, but compacting our OSDs takes about 1h at a time and the OSD will be marked down / suicide unless buffered I/O is enabled, so offline is the safest option (which, at 24 hosts, takes a loooooong time).

What's potentially of interest is that we did several rounds of moving thousands of PGs on this system on Luminous (with bluefs_buffered_io off of course) prior to this without issue.

#33 Updated by Joshua Baergen almost 2 years ago

Actually, I should be careful - we have definitely seen the symptom of high read rate on Luminous (https://tracker.ceph.com/issues/36482), but somehow it didn't lead to crippling system performance like it did on Nautilus.

#34 Updated by Joshua Baergen almost 2 years ago

Things keep occurring to me after I press <enter>. :)

When this issue occurs on our spinners, the read rate is very high - I've seen 800+ read IOPS. Since there's (almost?) no way that this could be occurring from the platter, this seems to tell me that whatever is happening is highly cacheable and thus at least part of the workload can largely be served by the HDD's cache.

#35 Updated by Joshua Baergen almost 2 years ago

Hey Dan/Eric, did either of you see a big increase in the number of writes hitting your disks when buffered mode was enabled?

#36 Updated by Dan van der Ster almost 2 years ago

Joshua Baergen wrote:

Hey Dan/Eric, did either of you see a big increase in the number of writes hitting your disks when buffered mode was enabled?

Not really, during large PG removals switching on buffered io had the opposite effect -- from hundreds of MBps on the dedicated block.dbs down to nearly idle.

That said, buffered io is not a silver bullet solution for this particular issue. We've now paused some large data migrations until 14.2.17 is released with the PG removal optimization.

#37 Updated by Joshua Baergen almost 2 years ago

Interesting, thanks. Is that 14.2.17 change this one: https://tracker.ceph.com/issues/47044 ?

FWIW, what I'm seeing is that the buffer commit caused by sync_file_range() is issuing a bunch of 512B I/Os, which are leaking through dm-crypt slow enough that they're not being merged by the I/O scheduler.

#38 Updated by Dan van der Ster almost 2 years ago

Joshua Baergen wrote:

Interesting, thanks. Is that 14.2.17 change this one: https://tracker.ceph.com/issues/47044 ?

Waiting for this: https://github.com/ceph/ceph/pull/38478

FWIW, what I'm seeing is that the buffer commit caused by sync_file_range() is issuing a bunch of 512B I/Os, which are leaking through dm-crypt slow enough that they're not being merged by the I/O scheduler.

No info on that, sorry.

#39 Updated by Joshua Baergen almost 2 years ago

No problem, and thanks for confirming!

#40 Updated by Andy Allan over 1 year ago

Andy Allan wrote:

In several cases, our df charts show a smooth decay curve (instead of a linear drop in disk usage over time). This leads me to think that #47044 could well be the underlying cause, since it fits with deletions starting out quickly but slowing down over time. It could also explain why only clusters like ours, with high numbers of objects per pg, are really noticing this problem.

I just wanted to jump back in here, to say that upgrading to 15.2.9 has resolved this problem for our cluster - pg deletions now happen linearly and complete in a reasonable amount of time, without any thrashing. Thanks Igor for the fix!

#41 Updated by Neha Ojha over 1 year ago

  • Backport set to pacific, octopus, nautilus
  • Pull request ID set to 38044

#42 Updated by Neha Ojha over 1 year ago

  • Status changed from New to Pending Backport

#43 Updated by Backport Bot over 1 year ago

  • Copied to Backport #49964: octopus: BlueStore::_collection_list causes huge latency growth pg deletion added

#44 Updated by Backport Bot over 1 year ago

  • Copied to Backport #49965: pacific: BlueStore::_collection_list causes huge latency growth pg deletion added

#45 Updated by Backport Bot over 1 year ago

  • Copied to Backport #49966: nautilus: BlueStore::_collection_list causes huge latency growth pg deletion added

#46 Updated by Loïc Dachary over 1 year ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF