Project

General

Profile

Actions

Bug #52804

closed

pacific: Hybrid Allocator exhibits high tail latency for writes in Octopus

Added by Kellen Renshaw over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After cluster upgrade from Mimic (v13.2.X) to Octopus (v15.2.12), the cluster exhibited increased tail latency for write operations across the cluster. All OSDs are NVMe SSD.

Aside from the version upgrade, significant settings changes were:
The {bluefs|bluestore}_allocator setting changed from bitmap to hybrid.
The bluestore_min_alloc_size_ssd was decreased to 4096 from 16384.

After switching the allocator back to bitmap (leaving bluestore_min_alloc_size_ssd at 4096), the observed tail latency returned to prior levels.

Debug logs were collected from a sample of the OSDs with
ceph daemon osd.{id} config set debug_bluestore 20/5
ceph daemon osd.{id} config set debug_bluefs 20/5
ceph daemon osd.{id} config set debug_bdev 20/3

From debug logs collected during the use of the hybrid allocator, latencies of up to 200 ms were observed in the sample of OSDs.
Ex:
2021-09-22T20:25:17.452+0000 7f693fe1a700 10 HybridAllocator allocate want 0x8000 unit 0x4000 max_alloc_size 0x8000 hint 0x0
...snip... (fbmap_alloc messages omitted)
2021-09-22T20:25:17.580+0000 7f693fe1a700 20 bluestore(/var/lib/ceph/osd/ceph-259) _do_alloc_write prealloc [0xd4988000~4000,0x6b7920000~4000]
Time elapsed (s.mmm): 0.128

Investigation of running OSDs showed significant time spent in AvlAllocator::_block_picker. Precise timing information is difficult to find as the AvlAllocator::_allocate functions do not have debug statements in v15.2.12.

Current thought is that this backport to Octopus (https://github.com/ceph/ceph/pull/38474) may be resulting in significant time spent in the AvlAllocator due to the loop attempting to collect smaller extents.

It may be that https://github.com/ceph/ceph/pull/41615 in master would address this by limiting the iterations over the tree.

Working on recreating the observed tail latencies in a test cluster to provide more conclusive data to support the AvlAllocate::_block_picker theory and possible improvements from PR #41615.


Files

ceph-osd.184.log.alloc-wumh.xz (31.5 KB) ceph-osd.184.log.alloc-wumh.xz Mauricio Oliveira, 10/15/2021 10:38 PM
ceph52804-results-pacific.png (40.9 KB) ceph52804-results-pacific.png Mauricio Oliveira, 10/29/2021 09:18 PM
ceph52804-results-octopus.png (40.5 KB) ceph52804-results-octopus.png Mauricio Oliveira, 10/29/2021 09:59 PM
Actions

Also available in: Atom PDF