Bug #52333
closedbeast frontend performance regressions
0%
Description
Mark Nelson did a bisect on rgw performance vs. cpu usage:
https://docs.google.com/spreadsheets/d/18WqVGdO_JxVWyboktK6N6raqSJZWBEyN2F4QOg29cqA
and identified two PRs related to the beast frontend that showed significant regressions:
https://github.com/ceph/ceph/pull/31580 which changed the coroutine stack allocator to use mmap/mprotect (resolved stack overflows)
https://github.com/ceph/ceph/pull/35355 which added timeouts to close idle connections
Files
Updated by Mark Kogan over 2 years ago
updating regarding testing of: https://github.com/ceph/ceph/pull/35355 -- beast timeout
to reproduce used the following test methodology in a vstart environment:
- GETs of large objects (4MB in this case)
- tune the load generator so that the performance ceiling would be dictated by compute and not storage (iostat % io util)
('hsbench -t 8 -b 8 ...' in this case on officinalis machine o07)
performance test results:
git reset --hard 2e9db1c9
- one commit before 5d5f9a0d rgw: Add request timeout beast : 2e9db1c9 Merge pull request #37658 from neha-ojha/wip-crimson-suite
numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 |& ccze
2021/08/31 15:13:18 Loop: 0, Int: TOTAL, Dur(s): 10.2, Mode: GET, Ops: 5000, MB/s: 1955.94, IO/s: 489
- 496 IO/s is the nominal reference performance before 5d5f9a0d rgw: Add request timeout beast
git reset --hard 5d5f9a0d
- with the commit 5d5f9a0d rgw: Add request timeout beast
numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 |& ccze
2021/08/31 15:07:55 Loop: 0, Int: TOTAL, Dur(s): 13.4, Mode: GET, Ops: 5000, MB/s: 1492.20, IO/s: 373
- 373 IO/s is the performance after 5d5f9a0d rgw: Add request timeout beast
373/489 = 24% performance drop
- with the commit 5d5f9a0d rgw: Add request timeout beast AND radosgw --rgw_frontends="beast port=8000 request_timeout_ms=0" ...
numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 |& ccze
2021/08/31 15:38:10 Loop: 0, Int: TOTAL, Dur(s): 10.1, Mode: GET, Ops: 5000, MB/s: 1982.43, IO/s: 496
- current master & --rgw_frontends="beast port=8000 request_timeout_ms=0"
git branch -vv
- master e675ea216a1 [origin/master] Merge pull request #42985 from cyx1231st/wip-seastore-lba-metrics
numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 |& ccze
2021/08/31 16:33:01 Loop: 0, Int: TOTAL, Dur(s): 9.8, Mode: GET, Ops: 5000, MB/s: 2031.20, IO/s: 508
with request_timeout_ms=0 the performance is restored back to the level before 5d5f9a0d rgw: Add request timeout beast
in the following steps will investigate what is the exact cause of the performance drop in the 5d5f9a0d commit code and if to change the default to request_timeout_ms=0
Updated by Casey Bodley over 2 years ago
- Status changed from New to In Progress
- Assignee set to Mark Kogan
Updated by Mark Kogan over 2 years ago
- Assignee changed from Mark Kogan to Greg Farnum
- Source deleted (
Development) - Severity deleted (
3 - minor)
adding...
found that the performance drop is incured specifically by the
timeout.expires_after(request_timeout);
at: https://github.com/ceph/ceph/blob/27faeaf129b3fa1c87dfdc957d8692efd2010d5e/src/rgw/rgw_asio_frontend.cc#L78
and: https://github.com/ceph/ceph/blob/27faeaf129b3fa1c87dfdc957d8692efd2010d5e/src/rgw/rgw_asio_frontend.cc#L196
BUT it occurs only when built with `-DALLOCATOR=libc` and does NOT occur when built with `-DALLOCATOR=tcmalloc`
summery table of the performance is:
libc + request_timeout_ms=65000 -> IO/s: 375
libc + request_timeout_ms=0 -> IO/s: 477
tcmalloc + request_timeout_ms=65000 -> IO/s: 622
tcmalloc + request_timeout_ms=0 -> IO/s: 637
benchmark run details:
env CC="/usr/bin/gcc" CXX="/usr/bin/g++" LDFLAGS="-Wl,--no-as-needed" \ ./do_cmake.sh -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DDIAGNOSTICS_COLOR=always -DBOOST_J=$(nproc) -DALLOCATOR=libc \ -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SYSTEM_BOOST=OFF time ionice nice cmake --build ./build -- radosgw -j $(nproc --ignore=1) bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=65000" --debug_ms=0 --debug_rgw=1 -f numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 2021/09/02 16:25:52 Loop: 0, Int: TOTAL, Dur(s): 13.3, Mode: GET, Ops: 5000, MB/s: 1501.59, IO/s: 375 bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=0" --debug_ms=0 --debug_rgw=1 -f numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 2021/09/02 16:28:02 Loop: 0, Int: TOTAL, Dur(s): 10.5, Mode: GET, Ops: 5000, MB/s: 1906.18, IO/s: 477 # 375/477 = 21% env CC="/usr/bin/gcc" CXX="/usr/bin/g++" LDFLAGS="-Wl,--no-as-needed" \ ./do_cmake.sh -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DDIAGNOSTICS_COLOR=always -DBOOST_J=$(nproc) -DALLOCATOR=tcmalloc \ -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SYSTEM_BOOST=OFF time ionice nice cmake --build ./build -- radosgw -j $(nproc --ignore=1) bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=65000" --debug_ms=0 --debug_rgw=1 -f numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 2021/09/02 16:16:48 Loop: 0, Int: TOTAL, Dur(s): 8.0, Mode: GET, Ops: 5000, MB/s: 2489.98, IO/s: 622 bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=0" --debug_ms=0 --debug_rgw=1 -f numactl -N 1 -m 1 hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 2021/09/02 16:20:29 Loop: 0, Int: TOTAL, Dur(s): 7.8, Mode: GET, Ops: 5000, MB/s: 2548.63, IO/s: 637
Updated by Greg Farnum over 2 years ago
- Assignee changed from Greg Farnum to Mark Kogan
I think some stray keystrokes erroneously assigned this to me?
Updated by Mark Kogan over 2 years ago
- Assignee changed from Mark Kogan to Greg Farnum
adding more info following continuing of the performance investigation -
the CPU load overhead and performance penalty incurred by enabling the `request_timeout_ms=65000`
varies depending on the number of threads (`--rgw_thread_pool_size`)
whilst the load beeing constant, 8 threades over 8 buckets (-t 8 -b 8):
hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01
and radosgw started with `request_timeout_ms=65000` and varrying `--rgw_thread_pool_size=X`
./bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=65000" --debug_ms=0 --debug_rgw=1 --rgw_thread_pool_size=X -f
the results are as following (note the `IO/s:`):
radosgw: --rgw_thread_pool_size=8 hsbench: 2021/09/08 13:13:29 Loop: 0, Int: TOTAL, Dur(s): 10.3, Mode: GET, Ops: 5000, MB/s: 1932.80, IO/s: 483 radosgw:--rgw_thread_pool_size=512 hsbench: 2021/09/08 13:14:12 Loop: 0, Int: TOTAL, Dur(s): 15.7, Mode: GET, Ops: 5000, MB/s: 1271.79, IO/s: 318 radosgw:--rgw_thread_pool_size=1024 hsbench: 2021/09/08 13:14:53 Loop: 0, Int: TOTAL, Dur(s): 16.9, Mode: GET, Ops: 5000, MB/s: 1182.22, IO/s: 296 radosgw:--rgw_thread_pool_size=2048 hsbench: 2021/09/08 13:16:56 Loop: 0, Int: TOTAL, Dur(s): 17.0, Mode: GET, Ops: 5000, MB/s: 1178.23, IO/s: 295
Updated by Mark Kogan over 2 years ago
- Assignee changed from Greg Farnum to Mark Kogan
Updated by Mark Kogan over 2 years ago
- File 2021-09-12_19-44.png 2021-09-12_19-44.png added
Adding perf and gdbpmp analysis showing the CPU overhead of the boost net::steady_timer when the `request_timeout_ms` is not set to 0.
in 2021-09-12_19-44.png (attached high res)
the left side is `request_timeout_ms=0` and the right side is `request_timeout_ms=65000`
the `AsioFrontend` thread consumes ~9.5% on the left .vs. ~23.5% on the right
filtering `/steady` in the perf data files
perf-timeout_0.data.xz : https://drive.google.com/file/d/1rmLNfL_KlDnUbSx9pOc8hN1rdChytKwU/view?usp=sharing
perf-timeout_65.data.xz : https://drive.google.com/file/d/1w8f3zr6ysTfhex_2xXJyYSl5Qw1eTpjP/view?usp=sharing
shows:
sudo perf report -i ./perf-timeout_0.data Samples: 149K of event 'cycles', Event count (approx.): 107377669142, DSO: libradosgw.so.2.0.0 Children Self Command Symbol 0.08% 0.01% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::max ◆ 0.05% 0.01% radosgw [.] std::chrono::operator!=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒ 0.05% 0.01% radosgw [.] std::chrono::operator==<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒ 0.04% 0.01% radosgw [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::expiry▒ 0.03% 0.02% radosgw [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒ 0.02% 0.01% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point ▒ 0.02% 0.02% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_since_epoch ▒ 0.01% 0.01% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::max@plt ▒ 0.01% 0.01% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point@plt ▒ 0.01% 0.01% radosgw [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::▒ 0.01% 0.00% radosgw [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒ 0.00% 0.00% radosgw [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒ 0.00% 0.00% radosgw [.] std::chrono::operator==<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒ 0.00% 0.00% radosgw [.] std::chrono::operator!=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒ 0.00% 0.00% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_since_epoch@plt ▒ 0.00% 0.00% radosgw [.] std::chrono::_V2::steady_clock::now@plt ▒ 0.00% 0.00% radosgw [.] std::chrono::operator<=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<▒ 0.00% 0.00% radosgw [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::expiry▒ 0.00% 0.00% radosgw [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒ 0.00% 0.00% radosgw [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::▒ 0.00% 0.00% radosgw [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::basic_▒ 0.00% 0.00% radosgw [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo▒ 0.00% 0.00% radosgw [.] boost::asio::use_service<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::▒ 0.00% 0.00% radosgw [.] boost::asio::detail::service_registry::use_service<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::s▒ 0.00% 0.00% radosgw [.] boost::asio::detail::service_registry::create<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady▒ 0.00% 0.00% radosgw [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::▒ sudo perf report -i ./perf-timeout_65.data Samples: 521K of event 'cycles', Event count (approx.): 161945408854, DSO: libradosgw.so.2.0.0 Children Self Command Symbol + 2.24% 0.01% radosgw [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::async_ + 2.22% 0.01% radosgw [.] boost::asio::async_initiate<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::timeout_handler< + 2.22% 0.00% radosgw [.] boost::asio::async_result<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::timeout_handler<bo + 2.18% 0.01% radosgw [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::initia + 2.15% 0.02% radosgw [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:: + 1.29% 0.01% radosgw [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::cancel + 1.28% 0.01% radosgw [.] boost::asio::detail::epoll_reactor::schedule_timer<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::c + 1.26% 0.04% radosgw [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:: + 1.15% 0.01% radosgw [.] boost::asio::detail::epoll_reactor::cancel_timer<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chr 0.40% 0.02% radosgw [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady 0.27% 0.01% radosgw [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady 0.22% 0.03% radosgw [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady 0.20% 0.02% radosgw [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady 0.11% 0.02% radosgw [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono: 0.10% 0.01% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::max 0.10% 0.01% radosgw [.] boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> >::subtract 0.08% 0.01% radosgw [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady 0.08% 0.06% radosgw [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo 0.08% 0.02% radosgw [.] std::chrono::operator< <std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio< 0.08% 0.02% radosgw [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::expiry 0.07% 0.01% radosgw [.] std::chrono::operator!=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio< 0.07% 0.07% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_since_epoch 0.06% 0.02% radosgw [.] std::chrono::operator==<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio< 0.06% 0.01% radosgw [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono: 0.06% 0.01% radosgw [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono: 0.05% 0.01% radosgw [.] boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> >::less_than 0.05% 0.02% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point 0.05% 0.00% radosgw [.] boost::asio::basic_waitable_timer<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock>, boost::asio::executor>::expire 0.04% 0.04% radosgw [.] __gnu_cxx::__normal_iterator<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_tr 0.04% 0.01% radosgw [.] std::chrono::operator>=<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio< 0.04% 0.03% radosgw [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady 0.04% 0.01% radosgw [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono: 0.04% 0.00% radosgw [.] boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono:: 0.04% 0.03% radosgw [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono: 0.04% 0.01% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point 0.03% 0.01% radosgw [.] boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady 0.03% 0.02% radosgw [.] __gnu_cxx::new_allocator<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits 0.03% 0.01% radosgw [.] std::allocator_traits<std::allocator<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio: 0.03% 0.01% radosgw [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono: 0.03% 0.02% radosgw [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono: 0.03% 0.02% radosgw [.] std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_point@plt 0.02% 0.02% radosgw [.] __gnu_cxx::operator==<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<st 0.02% 0.01% radosgw [.] std::vector<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono: 0.02% 0.01% radosgw [.] std::chrono::operator-<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1 0.02% 0.00% radosgw [.] boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> >::add 0.02% 0.02% radosgw [.] std::forward<boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono 0.02% 0.02% radosgw [.] std::chrono::_V2::steady_clock::now@plt 0.01% 0.01% radosgw [.] boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, bo 0.01% 0.01% radosgw [.] boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> >::to_posix_duration ...
Updated by Mark Kogan over 2 years ago
- File 2021-09-12_17-21.png 2021-09-12_17-21.png added
- File 2021-09-12_17-31.png 2021-09-12_17-31.png added
gdbpmp stack of `the AsioFrontend` thread
the left side is with `request_timeout_ms=0` and the right side is `request_timeout_ms=65000`
(call stack did not fit into a single screen page)
2021-09-12_17-21.png
<PgDn> 2021-09-12_17-31.png
(* in this test the boost::beast::tcp_stream was replaced with boost::beast::flat_stream, it did not affect the timer cpu usage issue)
Updated by Mark Kogan over 2 years ago
- File 2021-09-14_18-28.png 2021-09-14_18-28.png added
- File Screenshot_20210914_183339.png Screenshot_20210914_183339.png added
- File Screenshot_20210914_183148.png Screenshot_20210914_183148.png added
comparing flamegraph, zooming on the Frontend
where request_timeout_ms=65000 is on the left and request_timeout_ms=0 is on the right
(attached 2021-09-14_18-28.png)
the red square highlights the overhead of enabling the timeout
zooming into the left tower in the red square: Screenshot_20210914_183148.png
zooming into the right tower in the red square: Screenshot_20210914_183339.png
also on the top flamegraph level note that the pthread_cond_wait() consumes significantly more CPU when the request_timeout_ms=65000
2021-09-14_18-55.png
Updated by Mark Kogan over 2 years ago
- File 2021-09-14_18-55.png 2021-09-14_18-55.png added
Updated by Mark Kogan over 2 years ago
as discussed in bugscrub, possible related boost git issue:
https://github.com/boostorg/beast/issues/2200 -- Boost Beast http occupy 100% cpu while sending big file #2200
Updated by Mark Kogan over 2 years ago
adding details about the above flame graph generation methodology:
hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m cxip -bp bp01 sudo pkill -9 radosgw sudo perf record -e "cycles:u" -g -- ./bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=65000" --debug_ms=0 --debug_rgw=1 --rgw_thread_pool_size=1024 -f hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 sudo perf script report flamegraph mv -v ./flamegraph.html ./flamegraph-timeout_65.html sudo perf record -e "cycles:u" -g -- ./bin/radosgw --nolockdep -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000 request_timeout_ms=0" --debug_ms=0 --debug_rgw=1 --rgw_thread_pool_size=1024 -f hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp bp01 sudo perf script report flamegraph mv -v ./flamegraph.html ./flamegraph-timeout_0.html firefox ./flamegraph-timeout_65.html ./flamegraph-timeout_0.html
Updated by Casey Bodley over 2 years ago
some suggestions:
- use ceph::coarse_mono_clock instead of std::chrono::steady_clock for cheaper sampling
- boost::beast::tcp_steam uses std::chrono::steady_clock directly, so we'd need to replace boost::beast::tcp_steam with manual timeouts (boost::asio::basic_waitable_timer)
- use explicit executor types instead of the type-erased boost::asio::executor
- applies to spawn::yield_context, boost::asio::tcp::socket, boost::beast::tcp_steam (or boost::asio::basic_waitable_timer)
- there's a heap allocation for this type erasure, but i think this only happens once per constructor (so once per connection per spawn/socket/stream/timer)
- every operation on these type-erased executors incurs some extra overhead from virtual function calls. not sure whether this overhead is measurable
- investigate io patterns. if requests are generating several small reads/writes, each call to StreamIO::write_data() or recv_body() will reset the timer
- resetting the timer means a) canceling the previous timeout handler, and b) allocating a new handler in the next call to timer.async_wait()
- a custom handler allocator may help here, but it needs to be thread-safe (even if the timer runs on a strand executor)
- maybe we can manage timeouts at a higher level, instead of resetting the timer for each stream i/o operation?
- use explicit executor types
- replace boost::beast::tcp_steam with boost::asio::basic_waitable_timer<ceph::coarse_mono_clock>
- add custom allocator for timer handlers, using atomics for thread safety
Updated by Casey Bodley over 2 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 43761
Updated by Mark Kogan over 2 years ago
Casey Bodley wrote:
some suggestions:
- use ceph::coarse_mono_clock instead of std::chrono::steady_clock for cheaper sampling
- boost::beast::tcp_steam uses std::chrono::steady_clock directly, so we'd need to replace boost::beast::tcp_steam with manual timeouts (boost::asio::basic_waitable_timer)
- use explicit executor types instead of the type-erased boost::asio::executor
- applies to spawn::yield_context, boost::asio::tcp::socket, boost::beast::tcp_steam (or boost::asio::basic_waitable_timer)
- there's a heap allocation for this type erasure, but i think this only happens once per constructor (so once per connection per spawn/socket/stream/timer)
- every operation on these type-erased executors incurs some extra overhead from virtual function calls. not sure whether this overhead is measurablei'll prepare a PR with the following changes to measure separately:
- investigate io patterns. if requests are generating several small reads/writes, each call to StreamIO::write_data() or recv_body() will reset the timer
- resetting the timer means a) canceling the previous timeout handler, and b) allocating a new handler in the next call to timer.async_wait()
- a custom handler allocator may help here, but it needs to be thread-safe (even if the timer runs on a strand executor)
- maybe we can manage timeouts at a higher level, instead of resetting the timer for each stream i/o operation?
- use explicit executor types
- replace boost::beast::tcp_steam with boost::asio::basic_waitable_timer<ceph::coarse_mono_clock>
- add custom allocator for timer handlers, using atomics for thread safety
benchmarked performance of the `wip-beast-timeout-alloc` branch (without the 'rgw/beast: timer pool logs with DoutPrefixProvider' commit)
the results are very good, table of IO/S per branches and request_timeout_ms:
+--------------------------+-------------------------+--------+ | IO/s | wip-beast-timeout-alloc | master | +--------------------------+-------------------------+--------+ | request_timeout_ms=65000 | 415 | 375 | +--------------------------+-------------------------+--------+ | request_timeout_ms=0 | 421 | 412 | +--------------------------+-------------------------+--------+
hsbench results in detail:
# benchmarking on o07 - rgw_thread_pool_size=2048 request_timeout_ms=65000 + all commits except "gw/beast: timer pool logs with DoutPrefixProvider": numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp b01b |& tee hsbench.log | stdbuf -oL -eL colrm 128 | ccze -Aonolookups 2021/11/03 11:37:30 Loop: 0, Int: TOTAL, Dur(s): 8.4, Mode: GET, Ops: 5000, MB/s: 2383.37, IO/s: 415 ~ # benchmarking on o07 - rgw_thread_pool_size=2048 request_timeout_ms=0 + all commits except "gw/beast: timer pool logs with DoutPrefixProvider": numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp b01b |& tee hsbench.log | stdbuf -oL -eL colrm 128 | ccze -Aonolookups 2021/11/03 11:37:30 Loop: 0, Int: TOTAL, Dur(s): 8.4, Mode: GET, Ops: 5000, MB/s: 2383.37, IO/s: 421 ~
for reference master baseline:
# benchmarking on o07 - master : rgw_thread_pool_size=2048 request_timeout_ms=65000 numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp b01b |& tee hsbench.log | stdbuf -oL -eL colrm 128 | ccze -Aonolookups 2021/11/03 11:37:30 Loop: 0, Int: TOTAL, Dur(s): 8.4, Mode: GET, Ops: 5000, MB/s: 2383.37, IO/s: 375 ~ # benchmarking on o07 - master : rgw_thread_pool_size=2048 request_timeout_ms=0 numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4M -d -1 -t 8 -b 8 -n 5000 -m g -bp b01b |& tee hsbench.log | stdbuf -oL -eL colrm 128 | ccze -Aonolookups 2021/11/03 11:37:30 Loop: 0, Int: TOTAL, Dur(s): 8.4, Mode: GET, Ops: 5000, MB/s: 2383.37, IO/s: 412 ~
Updated by Casey Bodley over 2 years ago
thank you for testing Mark! i'm so glad this helps
if I could ask one last favor, could you try another run without the "rgw/beast: add custom allocator for timeout handlers" commit? it adds some extra complexity, and i want to make sure that complexity is really worth it
Updated by Casey Bodley over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 2 years ago
- Copied to Backport #53271: pacific: beast frontend performance regressions added
Updated by Backport Bot over 2 years ago
- Copied to Backport #53272: octopus: beast frontend performance regressions added
Updated by Loïc Dachary over 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".