Project

General

Profile

Actions

Bug #52333

closed

beast frontend performance regressions

Added by Casey Bodley over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
beast performance
Backport:
octopus pacific
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Mark Nelson did a bisect on rgw performance vs. cpu usage:

https://docs.google.com/spreadsheets/d/18WqVGdO_JxVWyboktK6N6raqSJZWBEyN2F4QOg29cqA

and identified two PRs related to the beast frontend that showed significant regressions:

https://github.com/ceph/ceph/pull/31580 which changed the coroutine stack allocator to use mmap/mprotect (resolved stack overflows)
https://github.com/ceph/ceph/pull/35355 which added timeouts to close idle connections


Files

2021-09-12_19-44.png (475 KB) 2021-09-12_19-44.png Mark Kogan, 09/13/2021 07:58 AM
2021-09-12_17-21.png (431 KB) 2021-09-12_17-21.png Mark Kogan, 09/13/2021 08:19 AM
2021-09-12_17-31.png (401 KB) 2021-09-12_17-31.png Mark Kogan, 09/13/2021 08:19 AM
2021-09-14_18-28.png (598 KB) 2021-09-14_18-28.png Mark Kogan, 09/14/2021 03:47 PM
Screenshot_20210914_183339.png (208 KB) Screenshot_20210914_183339.png Mark Kogan, 09/14/2021 03:51 PM
Screenshot_20210914_183148.png (369 KB) Screenshot_20210914_183148.png Mark Kogan, 09/14/2021 03:51 PM
2021-09-14_18-55.png (387 KB) 2021-09-14_18-55.png Mark Kogan, 09/14/2021 04:00 PM

Related issues 2 (0 open2 closed)

Copied to rgw - Backport #53271: pacific: beast frontend performance regressionsResolvedCory SnyderActions
Copied to rgw - Backport #53272: octopus: beast frontend performance regressionsResolvedCory SnyderActions
Actions #1

Updated by Mark Kogan over 2 years ago

updating regarding testing of: https://github.com/ceph/ceph/pull/35355 -- beast timeout

to reproduce used the following test methodology in a vstart environment:
- GETs of large objects (4MB in this case)
- tune the load generator so that the performance ceiling would be dictated by compute and not storage (iostat % io util)
('hsbench -t 8 -b 8 ...' in this case on officinalis machine o07)

performance test results:

git reset --hard 2e9db1c9

  1. one commit before 5d5f9a0d rgw: Add request timeout beast : 2e9db1c9 Merge pull request #37658 from neha-ojha/wip-crimson-suite

numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 |& ccze
2021/08/31 15:13:18 Loop: 0, Int: TOTAL, Dur(s): 10.2, Mode: GET, Ops: 5000, MB/s: 1955.94, IO/s: 489

  • 496 IO/s is the nominal reference performance before 5d5f9a0d rgw: Add request timeout beast

git reset --hard 5d5f9a0d

  1. with the commit 5d5f9a0d rgw: Add request timeout beast

numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 |& ccze
2021/08/31 15:07:55 Loop: 0, Int: TOTAL, Dur(s): 13.4, Mode: GET, Ops: 5000, MB/s: 1492.20, IO/s: 373

  • 373 IO/s is the performance after 5d5f9a0d rgw: Add request timeout beast

373/489 = 24% performance drop

  1. with the commit 5d5f9a0d rgw: Add request timeout beast AND radosgw --rgw_frontends="beast port=8000 request_timeout_ms=0" ...

numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 |& ccze
2021/08/31 15:38:10 Loop: 0, Int: TOTAL, Dur(s): 10.1, Mode: GET, Ops: 5000, MB/s: 1982.43, IO/s: 496

  1. current master & --rgw_frontends="beast port=8000 request_timeout_ms=0"
    git branch -vv
  • master e675ea216a1 [origin/master] Merge pull request #42985 from cyx1231st/wip-seastore-lba-metrics

numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 |& ccze
2021/08/31 16:33:01 Loop: 0, Int: TOTAL, Dur(s): 9.8, Mode: GET, Ops: 5000, MB/s: 2031.20, IO/s: 508

with request_timeout_ms=0 the performance is restored back to the level before 5d5f9a0d rgw: Add request timeout beast

in the following steps will investigate what is the exact cause of the performance drop in the 5d5f9a0d commit code and if to change the default to request_timeout_ms=0

Actions #2

Updated by Casey Bodley over 2 years ago

  • Status changed from New to In Progress
  • Assignee set to Mark Kogan
Actions #3

Updated by Mark Kogan over 2 years ago

  • Assignee changed from Mark Kogan to Greg Farnum
  • Source deleted (Development)
  • Severity deleted (3 - minor)

adding...
found that the performance drop is incured specifically by the

timeout.expires_after(request_timeout);

at: https://github.com/ceph/ceph/blob/27faeaf129b3fa1c87dfdc957d8692efd2010d5e/src/rgw/rgw_asio_frontend.cc#L78
and: https://github.com/ceph/ceph/blob/27faeaf129b3fa1c87dfdc957d8692efd2010d5e/src/rgw/rgw_asio_frontend.cc#L196

BUT it occurs only when built with `-DALLOCATOR=libc` and does NOT occur when built with `-DALLOCATOR=tcmalloc`

summery table of the performance is:
libc + request_timeout_ms=65000 -> IO/s: 375
libc + request_timeout_ms=0 -> IO/s: 477

tcmalloc + request_timeout_ms=65000 -> IO/s: 622
tcmalloc + request_timeout_ms=0 -> IO/s: 637

benchmark run details:

env CC="/usr/bin/gcc" CXX="/usr/bin/g++" LDFLAGS="-Wl,--no-as-needed" \
./do_cmake.sh -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DDIAGNOSTICS_COLOR=always -DBOOST_J=$(nproc) -DALLOCATOR=libc \
-DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SYSTEM_BOOST=OFF
time ionice nice cmake --build ./build -- radosgw -j $(nproc --ignore=1)

bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=65000" --debug_ms=0 --debug_rgw=1 -f

numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01
2021/09/02 16:25:52 Loop: 0, Int: TOTAL, Dur(s): 13.3, Mode: GET, Ops: 5000, MB/s: 1501.59, IO/s: 375

bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=0" --debug_ms=0 --debug_rgw=1 -f

numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01
2021/09/02 16:28:02 Loop: 0, Int: TOTAL, Dur(s): 10.5, Mode: GET, Ops: 5000, MB/s: 1906.18, IO/s: 477

# 375/477 = 21%

env CC="/usr/bin/gcc" CXX="/usr/bin/g++" LDFLAGS="-Wl,--no-as-needed" \
./do_cmake.sh -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DDIAGNOSTICS_COLOR=always -DBOOST_J=$(nproc) -DALLOCATOR=tcmalloc \
-DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SYSTEM_BOOST=OFF
time ionice nice cmake --build ./build -- radosgw -j $(nproc --ignore=1)

bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=65000" --debug_ms=0 --debug_rgw=1 -f

numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01
2021/09/02 16:16:48 Loop: 0, Int: TOTAL, Dur(s): 8.0, Mode: GET, Ops: 5000, MB/s: 2489.98, IO/s: 622

bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=0" --debug_ms=0 --debug_rgw=1 -f

numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01
2021/09/02 16:20:29 Loop: 0, Int: TOTAL, Dur(s): 7.8, Mode: GET, Ops: 5000, MB/s: 2548.63, IO/s: 637

Actions #4

Updated by Greg Farnum over 2 years ago

  • Assignee changed from Greg Farnum to Mark Kogan

I think some stray keystrokes erroneously assigned this to me?

Actions #5

Updated by Mark Kogan over 2 years ago

  • Assignee changed from Mark Kogan to Greg Farnum

adding more info following continuing of the performance investigation -

the CPU load overhead and performance penalty incurred by enabling the `request_timeout_ms=65000`
varies depending on the number of threads (`--rgw_thread_pool_size`)

whilst the load beeing constant, 8 threades over 8 buckets (-t 8 -b 8):

hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01

and radosgw started with `request_timeout_ms=65000` and varrying `--rgw_thread_pool_size=X`

./bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=65000" --debug_ms=0 --debug_rgw=1
--rgw_thread_pool_size=X -f

the results are as following (note the `IO/s:`):

radosgw: --rgw_thread_pool_size=8
hsbench: 2021/09/08 13:13:29 Loop: 0, Int: TOTAL, Dur(s): 10.3, Mode: GET, Ops: 5000, MB/s: 1932.80, IO/s: 483

radosgw:--rgw_thread_pool_size=512
hsbench: 2021/09/08 13:14:12 Loop: 0, Int: TOTAL, Dur(s): 15.7, Mode: GET, Ops: 5000, MB/s: 1271.79, IO/s: 318

radosgw:--rgw_thread_pool_size=1024
hsbench: 2021/09/08 13:14:53 Loop: 0, Int: TOTAL, Dur(s): 16.9, Mode: GET, Ops: 5000, MB/s: 1182.22, IO/s: 296

radosgw:--rgw_thread_pool_size=2048
hsbench: 2021/09/08 13:16:56 Loop: 0, Int: TOTAL, Dur(s): 17.0, Mode: GET, Ops: 5000, MB/s: 1178.23, IO/s: 295

Actions #6

Updated by Mark Kogan over 2 years ago

  • Assignee changed from Greg Farnum to Mark Kogan
Actions #7

Updated by Mark Kogan over 2 years ago

Adding perf and gdbpmp analysis showing the CPU overhead of the boost net::steady_timer when the `request_timeout_ms` is not set to 0.

in 2021-09-12_19-44.png (attached high res)
the left side is `request_timeout_ms=0` and the right side is `request_timeout_ms=65000`
the `AsioFrontend` thread consumes ~9.5% on the left .vs. ~23.5% on the right

filtering `/steady` in the perf data files
perf-timeout_0.data.xz : https://drive.google.com/file/d/1rmLNfL_KlDnUbSx9pOc8hN1rdChytKwU/view?usp=sharing
perf-timeout_65.data.xz : https://drive.google.com/file/d/1w8f3zr6ysTfhex_2xXJyYSl5Qw1eTpjP/view?usp=sharing

shows:

sudo perf report -i ./perf-timeout_0.data
Samples: 149K of event 'cycles', Event count (approx.): 107377669142, DSO: libradosgw.so.2.0.0
  Children      Self  Command  Symbol
     0.08%     0.01%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::max                                  ◆
     0.05%     0.01%  radosgw  [.] std::chrono::operator!=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒
     0.05%     0.01%  radosgw  [.] std::chrono::operator==<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒
     0.04%     0.01%  radosgw  [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::expiry▒
     0.03%     0.02%  radosgw  [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒
     0.02%     0.01%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point                           ▒
     0.02%     0.02%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_since_epoch                     ▒
     0.01%     0.01%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::max@plt                              ▒
     0.01%     0.01%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point@plt                       ▒
     0.01%     0.01%  radosgw  [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::▒
     0.01%     0.00%  radosgw  [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒
     0.00%     0.00%  radosgw  [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒
     0.00%     0.00%  radosgw  [.] std::chrono::operator==<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒
     0.00%     0.00%  radosgw  [.] std::chrono::operator!=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒
     0.00%     0.00%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_since_epoch@plt                 ▒
     0.00%     0.00%  radosgw  [.] std::chrono::_V2::steady_clock::now@plt                                                                                                                   ▒
     0.00%     0.00%  radosgw  [.] std::chrono::operator<=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒
     0.00%     0.00%  radosgw  [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::expiry▒
     0.00%     0.00%  radosgw  [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒
     0.00%     0.00%  radosgw  [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::▒
     0.00%     0.00%  radosgw  [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::basic_▒
     0.00%     0.00%  radosgw  [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒
     0.00%     0.00%  radosgw  [.] boost::asio::use_service<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::▒
     0.00%     0.00%  radosgw  [.] boost::asio::detail::service_registry::use_service<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::s▒
     0.00%     0.00%  radosgw  [.] boost::asio::detail::service_registry::create<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady▒
     0.00%     0.00%  radosgw  [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::▒

sudo perf report -i ./perf-timeout_65.data
Samples: 521K of event 'cycles', Event count (approx.): 161945408854, DSO: libradosgw.so.2.0.0                                                                                               
  Children      Self  Command  Symbol                                                                                                                                                        
+    2.24%     0.01%  radosgw  [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::async_
+    2.22%     0.01%  radosgw  [.] boost::asio::async_initiate<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::timeout_handler<
+    2.22%     0.00%  radosgw  [.] boost::asio::async_result<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::timeout_handler<bo
+    2.18%     0.01%  radosgw  [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::initia
+    2.15%     0.02%  radosgw  [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::
+    1.29%     0.01%  radosgw  [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::cancel
+    1.28%     0.01%  radosgw  [.] boost::asio::detail::epoll_reactor::schedule_timer<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::c
+    1.26%     0.04%  radosgw  [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::
+    1.15%     0.01%  radosgw  [.] boost::asio::detail::epoll_reactor::cancel_timer<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chr
     0.40%     0.02%  radosgw  [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady
     0.27%     0.01%  radosgw  [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady
     0.22%     0.03%  radosgw  [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady
     0.20%     0.02%  radosgw  [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady
     0.11%     0.02%  radosgw  [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:
     0.10%     0.01%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::max                                  
     0.10%     0.01%  radosgw  [.] boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> >::subtract              
     0.08%     0.01%  radosgw  [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady
     0.08%     0.06%  radosgw  [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo
     0.08%     0.02%  radosgw  [.] std::chrono::operator< <std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<
     0.08%     0.02%  radosgw  [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::expiry
     0.07%     0.01%  radosgw  [.] std::chrono::operator!=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<
     0.07%     0.07%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_since_epoch                     
     0.06%     0.02%  radosgw  [.] std::chrono::operator==<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<
     0.06%     0.01%  radosgw  [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:
     0.06%     0.01%  radosgw  [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:
     0.05%     0.01%  radosgw  [.] boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> >::less_than             
     0.05%     0.02%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point                           
     0.05%     0.00%  radosgw  [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::expire
     0.04%     0.04%  radosgw  [.] __gnu_cxx::__normal_iterator<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_tr
     0.04%     0.01%  radosgw  [.] std::chrono::operator>=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<
     0.04%     0.03%  radosgw  [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady
     0.04%     0.01%  radosgw  [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:
     0.04%     0.00%  radosgw  [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::
     0.04%     0.03%  radosgw  [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:
     0.04%     0.01%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point                           
     0.03%     0.01%  radosgw  [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady
     0.03%     0.02%  radosgw  [.] __gnu_cxx::new_allocator<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits
     0.03%     0.01%  radosgw  [.] std::allocator_traits<std::allocator<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio:
     0.03%     0.01%  radosgw  [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:
     0.03%     0.02%  radosgw  [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:
     0.03%     0.02%  radosgw  [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point@plt                       
     0.02%     0.02%  radosgw  [.] __gnu_cxx::operator==<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<st
     0.02%     0.01%  radosgw  [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:
     0.02%     0.01%  radosgw  [.] std::chrono::operator-<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1
     0.02%     0.00%  radosgw  [.] boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> >::add                   
     0.02%     0.02%  radosgw  [.] std::forward<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono
     0.02%     0.02%  radosgw  [.] std::chrono::_V2::steady_clock::now@plt                                                                                                                   
     0.01%     0.01%  radosgw  [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo
     0.01%     0.01%  radosgw  [.] boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> >::to_posix_duration  
...  

Updated by Mark Kogan over 2 years ago

gdbpmp stack of `the AsioFrontend` thread

the left side is with `request_timeout_ms=0` and the right side is `request_timeout_ms=65000`
(call stack did not fit into a single screen page)

2021-09-12_17-21.png

<PgDn> 2021-09-12_17-31.png

(* in this test the boost::beast::tcp_stream was replaced with boost::beast::flat_stream, it did not affect the timer cpu usage issue)

Updated by Mark Kogan over 2 years ago

comparing flamegraph, zooming on the Frontend
where request_timeout_ms=65000 is on the left and request_timeout_ms=0 is on the right


(attached 2021-09-14_18-28.png)

the red square highlights the overhead of enabling the timeout

zooming into the left tower in the red square: Screenshot_20210914_183148.png

zooming into the right tower in the red square: Screenshot_20210914_183339.png

also on the top flamegraph level note that the pthread_cond_wait() consumes significantly more CPU when the request_timeout_ms=65000

2021-09-14_18-55.png

Actions #11

Updated by Mark Kogan over 2 years ago

as discussed in bugscrub, possible related boost git issue:

https://github.com/boostorg/beast/issues/2200 -- Boost Beast http occupy 100% cpu while sending big file #2200

Actions #12

Updated by Mark Kogan over 2 years ago

adding details about the above flame graph generation methodology:


hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m cxip -bp bp01

sudo pkill -9 radosgw

sudo perf record -e "cycles:u" -g -- ./bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=65000" --debug_ms=0 --debug_rgw=1 --rgw_thread_pool_size=1024 -f

hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01

sudo perf script report flamegraph
mv -v ./flamegraph.html ./flamegraph-timeout_65.html

sudo perf record -e "cycles:u" -g -- ./bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=0" --debug_ms=0 --debug_rgw=1 --rgw_thread_pool_size=1024 -f

hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01

sudo perf script report flamegraph
mv -v ./flamegraph.html ./flamegraph-timeout_0.html

firefox ./flamegraph-timeout_65.html ./flamegraph-timeout_0.html
Actions #13

Updated by Casey Bodley over 2 years ago

some suggestions:

  • use ceph::coarse_mono_clock instead of std::chrono::steady_clock for cheaper sampling
    - boost::beast::tcp_steam uses std::chrono::steady_clock directly, so we'd need to replace boost::beast::tcp_steam with manual timeouts (boost::asio::basic_waitable_timer)
  • use explicit executor types instead of the type-erased boost::asio::executor
    - applies to spawn::yield_context, boost::asio::tcp::socket, boost::beast::tcp_steam (or boost::asio::basic_waitable_timer)
    - there's a heap allocation for this type erasure, but i think this only happens once per constructor (so once per connection per spawn/socket/stream/timer)
    - every operation on these type-erased executors incurs some extra overhead from virtual function calls. not sure whether this overhead is measurable
  • investigate io patterns. if requests are generating several small reads/writes, each call to StreamIO::write_data() or recv_body() will reset the timer
    - resetting the timer means a) canceling the previous timeout handler, and b) allocating a new handler in the next call to timer.async_wait()
    - a custom handler allocator may help here, but it needs to be thread-safe (even if the timer runs on a strand executor)
    - maybe we can manage timeouts at a higher level, instead of resetting the timer for each stream i/o operation?
i'll prepare a PR with the following changes to measure separately:
  • use explicit executor types
  • replace boost::beast::tcp_steam with boost::asio::basic_waitable_timer<ceph::coarse_mono_clock>
  • add custom allocator for timer handlers, using atomics for thread safety
Actions #14

Updated by Casey Bodley over 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 43761
Actions #15

Updated by Mark Kogan over 2 years ago

Casey Bodley wrote:

some suggestions:

  • use ceph::coarse_mono_clock instead of std::chrono::steady_clock for cheaper sampling
    - boost::beast::tcp_steam uses std::chrono::steady_clock directly, so we'd need to replace boost::beast::tcp_steam with manual timeouts (boost::asio::basic_waitable_timer)
  • use explicit executor types instead of the type-erased boost::asio::executor
    - applies to spawn::yield_context, boost::asio::tcp::socket, boost::beast::tcp_steam (or boost::asio::basic_waitable_timer)
    - there's a heap allocation for this type erasure, but i think this only happens once per constructor (so once per connection per spawn/socket/stream/timer)
    - every operation on these type-erased executors incurs some extra overhead from virtual function calls. not sure whether this overhead is measurable
  • investigate io patterns. if requests are generating several small reads/writes, each call to StreamIO::write_data() or recv_body() will reset the timer
    - resetting the timer means a) canceling the previous timeout handler, and b) allocating a new handler in the next call to timer.async_wait()
    - a custom handler allocator may help here, but it needs to be thread-safe (even if the timer runs on a strand executor)
    - maybe we can manage timeouts at a higher level, instead of resetting the timer for each stream i/o operation?
i'll prepare a PR with the following changes to measure separately:
  • use explicit executor types
  • replace boost::beast::tcp_steam with boost::asio::basic_waitable_timer<ceph::coarse_mono_clock>
  • add custom allocator for timer handlers, using atomics for thread safety

benchmarked performance of the `wip-beast-timeout-alloc` branch (without the 'rgw/beast: timer pool logs with DoutPrefixProvider' commit)
the results are very good, table of IO/S per branches and request_timeout_ms:

+--------------------------+-------------------------+--------+
|            IO/s          | wip-beast-timeout-alloc | master |
+--------------------------+-------------------------+--------+
| request_timeout_ms=65000 | 415                     | 375    |
+--------------------------+-------------------------+--------+
| request_timeout_ms=0     | 421                     | 412    |
+--------------------------+-------------------------+--------+

hsbench results in detail:

# benchmarking on o07 - rgw_thread_pool_size=2048 request_timeout_ms=65000 + all commits except "gw/beast: timer pool logs with DoutPrefixProvider": 

numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp b01b |& tee hsbench.log | stdbuf -oL -eL colrm 128 | ccze -Aonolookups
2021/11/03 11:37:30 Loop: 0, Int: TOTAL, Dur(s): 8.4, Mode: GET, Ops: 5000, MB/s: 2383.37, IO/s: 415 ~

# benchmarking on o07 - rgw_thread_pool_size=2048 request_timeout_ms=0 + all commits except "gw/beast: timer pool logs with DoutPrefixProvider": 

numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp b01b |& tee hsbench.log | stdbuf -oL -eL colrm 128 | ccze -Aonolookups
2021/11/03 11:37:30 Loop: 0, Int: TOTAL, Dur(s): 8.4, Mode: GET, Ops: 5000, MB/s: 2383.37, IO/s: 421 ~

for reference master baseline:

# benchmarking on o07 - master : rgw_thread_pool_size=2048 request_timeout_ms=65000

numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp b01b |& tee hsbench.log | stdbuf -oL -eL colrm 128 | ccze -Aonolookups
2021/11/03 11:37:30 Loop: 0, Int: TOTAL, Dur(s): 8.4, Mode: GET, Ops: 5000, MB/s: 2383.37, IO/s: 375 ~

# benchmarking on o07 - master : rgw_thread_pool_size=2048 request_timeout_ms=0

numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp b01b |& tee hsbench.log | stdbuf -oL -eL colrm 128 | ccze -Aonolookups
2021/11/03 11:37:30 Loop: 0, Int: TOTAL, Dur(s): 8.4, Mode: GET, Ops: 5000, MB/s: 2383.37, IO/s: 412 ~

Actions #16

Updated by Casey Bodley over 2 years ago

thank you for testing Mark! i'm so glad this helps

if I could ask one last favor, could you try another run without the "rgw/beast: add custom allocator for timeout handlers" commit? it adds some extra complexity, and i want to make sure that complexity is really worth it

Actions #17

Updated by Casey Bodley over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #18

Updated by Backport Bot over 2 years ago

  • Copied to Backport #53271: pacific: beast frontend performance regressions added
Actions #19

Updated by Backport Bot over 2 years ago

  • Copied to Backport #53272: octopus: beast frontend performance regressions added
Actions #20

Updated by Loïc Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF