Project

General

Profile

Actions

Bug #52804

closed

pacific: Hybrid Allocator exhibits high tail latency for writes in Octopus

Added by Kellen Renshaw over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After cluster upgrade from Mimic (v13.2.X) to Octopus (v15.2.12), the cluster exhibited increased tail latency for write operations across the cluster. All OSDs are NVMe SSD.

Aside from the version upgrade, significant settings changes were:
The {bluefs|bluestore}_allocator setting changed from bitmap to hybrid.
The bluestore_min_alloc_size_ssd was decreased to 4096 from 16384.

After switching the allocator back to bitmap (leaving bluestore_min_alloc_size_ssd at 4096), the observed tail latency returned to prior levels.

Debug logs were collected from a sample of the OSDs with
ceph daemon osd.{id} config set debug_bluestore 20/5
ceph daemon osd.{id} config set debug_bluefs 20/5
ceph daemon osd.{id} config set debug_bdev 20/3

From debug logs collected during the use of the hybrid allocator, latencies of up to 200 ms were observed in the sample of OSDs.
Ex:
2021-09-22T20:25:17.452+0000 7f693fe1a700 10 HybridAllocator allocate want 0x8000 unit 0x4000 max_alloc_size 0x8000 hint 0x0
...snip... (fbmap_alloc messages omitted)
2021-09-22T20:25:17.580+0000 7f693fe1a700 20 bluestore(/var/lib/ceph/osd/ceph-259) _do_alloc_write prealloc [0xd4988000~4000,0x6b7920000~4000]
Time elapsed (s.mmm): 0.128

Investigation of running OSDs showed significant time spent in AvlAllocator::_block_picker. Precise timing information is difficult to find as the AvlAllocator::_allocate functions do not have debug statements in v15.2.12.

Current thought is that this backport to Octopus (https://github.com/ceph/ceph/pull/38474) may be resulting in significant time spent in the AvlAllocator due to the loop attempting to collect smaller extents.

It may be that https://github.com/ceph/ceph/pull/41615 in master would address this by limiting the iterations over the tree.

Working on recreating the observed tail latencies in a test cluster to provide more conclusive data to support the AvlAllocate::_block_picker theory and possible improvements from PR #41615.


Files

ceph-osd.184.log.alloc-wumh.xz (31.5 KB) ceph-osd.184.log.alloc-wumh.xz Mauricio Oliveira, 10/15/2021 10:38 PM
ceph52804-results-pacific.png (40.9 KB) ceph52804-results-pacific.png Mauricio Oliveira, 10/29/2021 09:18 PM
ceph52804-results-octopus.png (40.5 KB) ceph52804-results-octopus.png Mauricio Oliveira, 10/29/2021 09:59 PM
Actions #1

Updated by Neha Ojha over 2 years ago

  • Project changed from RADOS to bluestore
  • Category deleted (Performance/Resource Usage)
Actions #2

Updated by Igor Fedotov over 2 years ago

Hi Kellen,
could you please collect and share allocator free chunks dump via "ceph daemon osd.N bluestore allocator dump block" command.

We'll be able to recreate allocator's state from that dump and do some experiments/troubleshooting.

Thanks,
Igor

Actions #3

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by Neha Ojha over 2 years ago

  • Description updated (diff)
Actions #6

Updated by Mauricio Oliveira over 2 years ago

Good progress this week!

We could reproduce the issue with the allocator's state dump and the allocation requests from the OSD debug log.

We added a fake unit test to `Allocator_test.cc` that reads from stdin: the allocator parameters, free chunks and allocation requests (to init/config/test it); and prints to stdout the time elapsed for each allocation request (latency) in milliseconds.

We grouped the allocations by elapsed time (i.e., there are N allocations that took T milliseconds, each) at 10ms granularity.

Findings:
- Part 1: The AVL allocator has a much longer tail than the bitmap allocator.
- Part 2: This doesn't seem to be a regression from PR#38474 (ENOSPC fix/loop.)
- Part 3: Improvements from PR#41615, on top of other changes for AVL.
  • Part 4: fake unit test and usage/steps.

Please let us know of any concerns/suggestions you may have with/for this approach.

It would seem like we should try and backport the commits in Part 3.

Let's see what the comments bring up and we can review it next week.

cheers,
Mauricio

Actions #7

Updated by Mauricio Oliveira over 2 years ago

Part 1)

The numbers for the Bitmap and AVL allocators show a long tail on AVL only:
- average of 3 runs
- bitmap's tail ends on 100~109ms
- avl's tail goes on to 610~619ms
- avl seems to improve/move most allocations taking 10~19ms to 0~9ms,
but spread those taking 20~29ms over 30~59ms, up to 619ms (long tail).

- Bitmap:

[ms]:     [#alloc]
0~9:     3015558.67
10~19:     8759.00
20~29:     1489.67
30~39:     370.67
40~49:     84.00
50~59:     23.67
60~69:     7.00
70~79:     4.33
80~89:     1.00
90~99:     1.00
100~109: 1.33

- AVL:

[ms]:     [#alloc]    [delta to bitmap]
0~9:     3023100.67    7542.00
10~19:     1157.33    -7601.67
20~29:     560.00        -929.67
30~39:     510.67        140.00
40~49:     767.00        683.00
50~59:     128.00        104.33
60~69:     19.00        12.00
70~79:     21.00        16.67
80~89:     15.00        14.00
90~99:     3.33        2.33
100~109: 4.33        3.00
...    
110~119: 0.67
120~129: 1.67
130~139: 1.33
140~149:
150~159: 1.00
160~169:    
170~179:    
180~189: 1.00
190~199:    
…    
200~209: 0.67
210~219: 0.33
220~229:    
230~239: 1.00
240~249: 0.67
250~259: 1.00
260~269: 1.00
270~279: 0.33
280~289:    
290~299:    
…    
310~319: 0.33
320~329: 0.33
330~339: 0.33
…    
550~559: 0.67
560~569: 0.33
570~579: 0.33
590~599: 0.33
…    
610~619: 0.33
Actions #8

Updated by Mauricio Oliveira over 2 years ago

Part 2)

This does not seem to be a regression from this commit introduced in v15.2.9:
c25def8 octopus: os/bluestore: fix inappropriate ENOSPC from avl/hybrid allocator

- AVL numbers with it reverted are very similar:

[ms]:     [#alloc]    [delta to bitmap]
0~9:     3023137.67    7579.00
10~19:     1172.33    -7586.67
20~29:     564.67        -925.00
30~39:     540.33        169.67
40~49:     709.67        625.67
50~59:     102.67        79.00
60~69:     17.00        10.00
70~79:     20.67        16.33
80~89:     11.67        10.67
90~99:     8.00        7.00
100~109: 1.33        0.00
…        
110~119: 0.67    
120~129: 2.00    
130~139: 1.33    
140~149: 0.67    
150~159: 0.33    
160~169:        
170~179: 0.33    
180~189: 0.33    
190~199: 0.67    
…        
200~209: 0.67    
210~219:        
220~229: 1.00    
230~239: 0.33    
240~249: 1.33    
250~259: 0.33    
260~269: 0.33    
270~279: 0.33    
280~289: 0.67    
290~299:        
…        
300~309: 0.67    
310~319: 0.67    
320~329: 0.33    
330~339:        
…        
540~549: 0.33    
550~559:        
560~569:        
570~579: 0.33    
580~589: 0.33    
590~599:
Actions #9

Updated by Mauricio Oliveira over 2 years ago

Part 3)

With these commits for the AVL allocator backported to v15.2.12 (at least one is already in the latest 15.2.x), ...

fd5ca26 os/bluestore: do not call _block_picker() again if already searched from start()
c732060 os/bluestore: Improve _block_picker function
0eed13a os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators.
4837166 os/bluestore/AvlAllocator: specialize _block_picker()
40f05b9 os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_search_count
5a26875 os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_search_bytes

there's a reasonable improvement:
- average of 3 runs
- avl's tail goes on to 220~229ms, with 3 allocations in 110~199ms)
(down from 610~619ms, with 5.67 allocations in 110~199ms)
- avl seems to improve/move most allocations taking 10~29ms (from 10~19ms) to 0~9ms,
but spread the remaining ones (a lot less) 30~49 ms, and has less in 50~79ms too.

[ms]:     [#alloc]    [delta to bitmap]
0~9:     3024305.00    8746.33
10~19:     926.67        -7832.33
20~29:     479.33        -1010.33
30~39:     401.33        30.67
40~49:     168.67        84.67
50~59:     6.00        -17.67
60~69:     2.33        -4.67
70~79:     3.00        -1.33
80~89:     1.00        0.00
90~99:     1.00        0.00
100~109: 1.50        0.17
…        
110~119: 1.00    
120~129:        
130~139:        
140~149: 1.00    
150~159: 0.33    
160~169: 0.33    
170~179: 0.33    
180~189:        
190~199:        
…        
200~209: 0.67    
210~219: 0.33    
220~229: 1.00    

Note: all 6 commits are involved in the improvement.
Just the first 4 or 2 commits are still mostly like
w/out any commits for the 0~109ms range, on average.
There's just small improvements in the tail length:
- 4 commits: up to 279ms.
- 2 commits: up to 319ms.

Actions #10

Updated by Mauricio Oliveira over 2 years ago

Part 4)

Fake unit test:

@ ceph.git/src/test/objectstore/Allocator_test.cc


TEST_P(AllocTest, test_alloc_repro52804)
{
  std::string line1, line2, line3, line4;
  uint64_t size, min_alloc_size;
  uint64_t offset, length;
  uint64_t want, unit, max, hint;

  std::cout << "Reading size and min_alloc_size ..." << std::endl;
  std::getline(std::cin, line1);
  size = std::stoul(line1, nullptr, 10);
  std::getline(std::cin, line1);
  min_alloc_size = std::stoul(line1, nullptr, 10);
  std::cout << "size: " << size << ", min_alloc_size: " << min_alloc_size << std::endl;

  init_alloc(size, min_alloc_size);

  std::cout << "Reading offset and length (until 2 empty lines) ..." << std::endl;
  while (std::getline(std::cin, line1) &&
         std::getline(std::cin, line2)) {

        if (line1.empty() && line2.empty())
                break;

        offset = std::stoul(line1, nullptr, 0);
        length = std::stoul(line2, nullptr, 0);

        alloc->init_add_free(offset, length);
  }

  std::cout << "Reading want, unit, max, hint (until EOF) ..." << std::endl;
  while (std::getline(std::cin, line1) &&
         std::getline(std::cin, line2) &&
         std::getline(std::cin, line3) &&
         std::getline(std::cin, line4)) {

        want = std::stoul(line1, nullptr, 0);
        unit = std::stoul(line2, nullptr, 0);
        max = std::stoul(line3, nullptr, 0);
        hint = std::stoul(line4, nullptr, 0);

        /* timestamp in milliseconds (before) */
        auto ts1 = std::chrono::system_clock::now();
        auto ts1_ms = std::chrono::time_point_cast<std::chrono::milliseconds>(ts1);
        int ms1 = ts1_ms.time_since_epoch().count();

        /* allocate */
        PExtentVector extents;
        alloc->allocate(want, unit, max, hint, &extents);

        /* timestamp in milliseconds (after) */
        auto ts2 = std::chrono::system_clock::now();
        auto ts2_ms = std::chrono::time_point_cast<std::chrono::milliseconds>(ts2);
        int ms2 = ts2_ms.time_since_epoch().count();

        std::cout << "delta-ms: " << (ms2 - ms1)
                  << " want/unit/max/hint (hex): " << hex
                  << want << "/" << unit << "/" << max << "/" << hint
                  << std::endl;

        /* Do not release. */
        //alloc->release(extents);
        extents.clear();
  }
}

Usage:

$ ./bin/unittest_alloc --gtest_list_tests 2>/dev/null | grep 52804
  test_alloc_repro52804/0  # GetParam() = "stupid" 
  test_alloc_repro52804/1  # GetParam() = "bitmap" 
  test_alloc_repro52804/2  # GetParam() = "avl" 
  test_alloc_repro52804/3  # GetParam() = "hybrid" 

$ for a in 1 2; do
    for i in 1 2 3; do 
      echo "ALLOC $a - RUN $i" 

      (echo 3840753532928 
       echo 4096 

       cat ~/ceph-osd.184_alloc_dump.json | grep -e offset -e length | grep -o '0x[0-9a-f]\+'
       echo
       echo

       for i in $(seq 1 100); do 
         cat ~/ceph-osd.184.log.alloc-wumh | tr ' ' '\n'
       done) \
       | ./bin/unittest_alloc --gtest_filter=Allocator/AllocTest.test_alloc_repro52804/$a \
           > ~/test.alloc$a.run$i.log; 
    done
  done

Input:

- parameters from the OSD
- ceph-osd.184_alloc_dump.json (comment #5)
- ceph-osd.184.log.alloc-wumh (attached)

The last one is just the want/unit/max_alloc_size/hint parameters from allocations in ceph-osd.184.log:

2021-09-22T20:26:13.588+0000 7f4c7e8b9700 10 HybridAllocator allocate want 0x4000 unit 0x4000 max_alloc_size 0x4000 hint 0x0

Actions #12

Updated by Igor Fedotov over 2 years ago

  • Status changed from Need More Info to New

@Mauricio - great findings, thanks a lot!
So the major point is that we need to backport all the mentioned patches to both Pacific and Octopus to improve tail latency in AVL allocator, right?

It would be nice if you run the same benchmark against recently introduced "btree" allocator and share the numbers, see https://github.com/ceph/ceph/pull/41828

As a side note - you might want to use/adapt https://github.com/ceph/ceph/blob/master/src/test/objectstore/allocator_replay_test.cc
tool for your benchmarking "productization" - it's able to restore allocator's state from the dump, it would be great to extend it with that "allocation request replay" feature. Not to mention this wouldn't "spoil" regular unit tests...

Actions #13

Updated by Igor Fedotov over 2 years ago

Additionally it would be interesting to learn allocations which produce that "latency tail". Would you add some printing if op's duration was long enough?

Actions #14

Updated by Igor Fedotov over 2 years ago

  • Status changed from New to Triaged
Actions #15

Updated by Mauricio Oliveira over 2 years ago

@Igor Gajowiak

Right, the key point is to backport the patches to Pacific/Octopus. I'm working on it, if that is OK w/ you.

Regarding the numbers with the btree allocator, unfortunately it hits an assertion failure that is reproducible on master (I first thought it might have been a bad backport), which seems to suggest that the tree logic (or the assert) might not be precisely right, I guess.

Kellen has been working on your suggestions for the allocator replay test and checking for a pattern on allocations that produce long tail latency.

Thanks!
Mauricio

...

Assertion failure w/ the btree allocator:

- commit bdad93759b ("Merge PR #43627 into master")
- repro steps in #10 and #11 (plus add "btree" at
src/test/objectstore/Allocator_test.cc's bottom.)
- happens on _remove_from_tree() during allocation,
a few seconds in during the test (not right away.)

../src/os/bluestore/BtreeAllocator.cc: In function 'void BtreeAllocator::_remove_from_tree(uint64_t, uint64_t)' thread 7f20d5b2ac80 time 2021-10-27T19:47:19.683195+0000
../src/os/bluestore/BtreeAllocator.cc: 171: FAILED ceph_assert(rs != range_tree.end())
 ceph version Development (no_version) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x127) [0x7f20d6a6a1ed]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f20d6a6a418]
 3: (BtreeAllocator::_remove_from_tree(unsigned long, unsigned long)+0xfa) [0x559b1692d8d8]
 4: (BtreeAllocator::_allocate(unsigned long, unsigned long, unsigned long*, unsigned long*)+0x18e) [0x559b1692ee12]
 5: (BtreeAllocator::_allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)5, bluestore_pextent
_t> >*)+0x55) [0x559b1692f325]
 6: (BtreeAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)5, bluestore_pextent_
t> >*)+0xbe) [0x559b1692f47a]
 7: (AllocTest_test_alloc_sf319356_Test::TestBody()+0x627) [0x559b168c5ccd]
 8: (void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x1f) [0x559b16908892]
 9: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x84) [0x559b16911757]
 10: (testing::Test::Run()+0xb8) [0x559b168feae8]
 11: (testing::TestInfo::Run()+0x108) [0x559b168fee02]
 12: (testing::TestSuite::Run()+0xb6) [0x559b168ff094]
 13: (testing::internal::UnitTestImpl::RunAllTests()+0x415) [0x559b16902e83]
 14: (bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*
)(), char const*)+0x1f) [0x559b16909055]
 15: (bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)()
, char const*)+0x84) [0x559b16911bf1]
 16: (testing::UnitTest::Run()+0x9e) [0x559b168feb88]
 17: main()
 18: /lib/x86_64-linux-gnu/libc.so.6(+0x2dfd0) [0x7f20d614efd0]
 19: __libc_start_main()
 20: _start()
*** Caught signal (Aborted) **
 in thread 7f20d5b2ac80 thread_name:unittest_alloc
2021-10-27T19:47:19.683+0000 7f20d5b2ac80 -1 ../src/os/bluestore/BtreeAllocator.cc: In function 'void BtreeAllocator::_remove_from_tree(uint64_t, uint64_t)' thread 7f20d5b2ac80 time
2021-10-27T19:47:19.683195+0000
../src/os/bluestore/BtreeAllocator.cc: 171: FAILED ceph_assert(rs != range_tree.end())

...

--- begin dump of recent events ---
   -28> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command assert hook 0x559b18998a80
   -27> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command abort hook 0x559b18998a80
   -26> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command leak_some_memory hook 0x559b18998a80
   -25> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command perfcounters_dump hook 0x559b18998a80
   -24> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command 1 hook 0x559b18998a80
   -23> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command perf dump hook 0x559b18998a80
   -22> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command perfcounters_schema hook 0x559b18998a80
   -21> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command perf histogram dump hook 0x559b18998a80
   -20> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command 2 hook 0x559b18998a80
   -19> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command perf schema hook 0x559b18998a80
   -18> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command perf histogram schema hook 0x559b18998a80
   -17> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command perf reset hook 0x559b18998a80
   -16> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command config show hook 0x559b18998a80
   -15> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command config help hook 0x559b18998a80
   -14> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command config set hook 0x559b18998a80
   -13> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command config unset hook 0x559b18998a80
   -12> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command config get hook 0x559b18998a80
   -11> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command config diff hook 0x559b18998a80
   -10> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command config diff get hook 0x559b18998a80
    -9> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command injectargs hook 0x559b18998a80
    -8> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command log flush hook 0x559b18998a80
    -7> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command log dump hook 0x559b18998a80
    -6> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command log reopen hook 0x559b18998a80
    -5> 2021-10-27T19:46:46.727+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command dump_mempools hook 0x559b196e8068
    -4> 2021-10-27T19:46:46.735+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command bluestore allocator dump 94124621378480 hook 0x559b189f33b0
    -3> 2021-10-27T19:46:46.735+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command bluestore allocator score 94124621378480 hook 0x559b189f33b0
    -2> 2021-10-27T19:46:46.735+0000 7f20d5b2ac80  5 asok(0x559b18a06000) register_command bluestore allocator fragmentation 94124621378480 hook 0x559b189f33b0
    -1> 2021-10-27T19:47:19.683+0000 7f20d5b2ac80 -1 ../src/os/bluestore/BtreeAllocator.cc: In function 'void BtreeAllocator::_remove_from_tree(uint64_t, uint64_t)' thread 7f20d5b2ac
80 time 2021-10-27T19:47:19.683195+0000
../src/os/bluestore/BtreeAllocator.cc: 171: FAILED ceph_assert(rs != range_tree.end())

162 void BtreeAllocator::_remove_from_tree(uint64_t start, uint64_t size)
163 {
164   uint64_t end = start + size;
165
166   ceph_assert(size != 0);
167   ceph_assert(size <= num_free);
168
169   auto rs = range_tree.find(start);
170   /* Make sure we completely overlap with someone */
171   ceph_assert(rs != range_tree.end());
172   ceph_assert(rs->first <= start);
173   ceph_assert(rs->second >= end);
174
175   _process_range_removal(start, end, rs);
176 }
Actions #16

Updated by Igor Fedotov over 2 years ago

Mauricio Oliveira wrote:

@Igor Gajowiak

Right, the key point is to backport the patches to Pacific/Octopus. I'm working on it, if that is OK w/ you.

Yeah, that's great!
If possible it would be great to have Pacific backports ASAP as new minor release is coming in a week or two. Feel free to contact me directly via e-mail if you need some help for that.

Regarding the numbers with the btree allocator, unfortunately it hits an assertion failure that is reproducible on master (I first thought it might have been a bad backport), which seems to suggest that the tree logic (or the assert) might not be precisely right, I guess.

OK, we'll need to investigate that further sooner or later I believe...

Kellen has been working on your suggestions for the allocator replay test and checking for a pattern on allocations that produce long tail latency.

Super!

Thanks!
Mauricio

...

Assertion failure w/ the btree allocator:

- commit bdad93759b ("Merge PR #43627 into master")
- repro steps in #10 and #11 (plus add "btree" at
src/test/objectstore/Allocator_test.cc's bottom.)
- happens on _remove_from_tree() during allocation,
a few seconds in during the test (not right away.)

[...]

Actions #17

Updated by Mauricio Oliveira over 2 years ago

Igor Fedotov wrote:

If possible it would be great to have Pacific backports ASAP as new minor release is coming in a week or two. Feel free to contact me directly via e-mail if you need some help for that.

Ack; thanks. I'm actually working on testing the Paciifc backport today, so that should be doable if all goes well.

cheers,
Mauricio

Actions #18

Updated by Mauricio Oliveira over 2 years ago

Pacific backport PR: https://github.com/ceph/ceph/pull/43745

Attaching chart with tail latency improvements.

Actions #20

Updated by Mauricio Oliveira over 2 years ago

Attaching chart with tail latency improvements (Octopus)

Actions #22

Updated by Neha Ojha over 2 years ago

  • Subject changed from Hybrid Allocator exhibits high tail latency for writes in Octopus to pacific: Hybrid Allocator exhibits high tail latency for writes in Octopus
  • Status changed from Triaged to Fix Under Review
Actions #23

Updated by Igor Fedotov over 2 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF