Project

General

Profile

Bug #52804

Updated by Neha Ojha over 2 years ago

After cluster upgrade from Mimic (v13.2.X) to Octopus (v15.2.12), the cluster exhibited increased tail latency for write operations across the cluster. All OSDs are NVMe SSD.  

 Aside from the version upgrade, significant settings changes were: 
 The {bluefs|bluestore}_allocator setting changed from bitmap to hybrid. 
 The bluestore_min_alloc_size_ssd was decreased to 4096 from 16384. 

 After switching the allocator back to bitmap (leaving bluestore_min_alloc_size_ssd at 4096), the observed tail latency returned to prior levels. 

 Debug logs were collected from a sample of the OSDs with 
 ceph daemon osd.{id} config set debug_bluestore 20/5 
 ceph daemon osd.{id} config set debug_bluefs 20/5 
 ceph daemon osd.{id} config set debug_bdev 20/3 

 From debug logs collected during the use of the hybrid allocator, latencies of up to 200 ms were observed in the sample of OSDs. 
 Ex: 
 2021-09-22T20:25:17.452+0000 7f693fe1a700 10 HybridAllocator allocate want 0x8000 unit 0x4000 max_alloc_size 0x8000 hint 0x0 
 ...snip... (fbmap_alloc messages omitted) 
 2021-09-22T20:25:17.580+0000 7f693fe1a700 20 bluestore(/var/lib/ceph/osd/ceph-259) _do_alloc_write prealloc [0xd4988000~4000,0x6b7920000~4000] 
 Time elapsed (s.mmm): 0.128 

 Investigation of running OSDs showed significant time spent in AvlAllocator::_block_picker. Precise timing information is difficult to find as the AvlAllocator::_allocate functions do not have debug statements in v15.2.12. 

 Current thought is that this backport to Octopus (https://github.com/ceph/ceph/pull/38474) may be resulting in significant time spent in the AvlAllocator due to the loop attempting to collect smaller extents. 

 It may be that https://github.com/ceph/ceph/pull/41615 PR #41615 (https://github.com/ceph/ceph/pull/41615) in master would address this by limiting the iterations over the tree. 

 Working on recreating the observed tail latencies in a test cluster to provide more conclusive data to support the AvlAllocate::_block_picker theory and possible improvements from PR #41615.

Back