Bug #65664
openCrash observed in boost::asio module related to stream.async_shutdown()
0%
Description
continuing from downstream BZ#2275284
call stack:
completing the missing callstack symbols using addr2line: "backtrace": [' "/lib64/libc.so.6(+0x54db0) [0x7fd314053db0]",' "/usr/bin/radosgw(+0x33b8ea) [0x55783b4ae8ea]",' 0x000000000033b8ea: boost::asio::detail::epoll_reactor::start_op(int, int, boost::asio::detail::epoll_reactor::descriptor_state*&, boost::asio::detail::reactor_op*, bool, bool) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/epoll_reactor.ipp:246:3 "/usr/bin/radosgw(+0x35ba27) [0x55783b4cea27]",' 0x000000000035ba27: boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >::operator()(boost::system::error_code, unsigned long, int) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/reactive_socket_service_base.hpp:419:13 "(boost::asio::detail::executor_op<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1d2) [0x55783b4ef882]",' "/usr/bin/radosgw(+0x3807de) [0x55783b4f37de]",' 0x00000000003807de: boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>::operator()() at /usr/include/c++/11/bits/shared_ptr_base.h:1296:16 "/usr/bin/radosgw(+0x379910) [0x55783b4ec910]",' 0x0000000000379910: void boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul>::execute<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void> >(boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>&&) const at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/impl/io_context.hpp:300:3 "(boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x6a6) [0x55783b4dda06]",' "/usr/bin/radosgw(+0xb8534e) [0x55783bcf834e]",' 0x0000000000b8534e: boost::asio::detail::thread_info_base::rethrow_pending_exception() at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/thread_info_base.hpp:228:5 (inlined by) boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:493:46 (inlined by) boost::asio::detail::scheduler::run(boost::system::error_code&) [clone .constprop.0] [clone .isra.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:210:20 "/usr/bin/radosgw(+0x3cf04d) [0x55783b54204d]",' 0x00000000003cf04d: std::thread::_State_impl<std::thread::_Invoker<std::tuple<(anonymous namespace)::AsioFrontend::run()::{lambda()#2}> > >::_M_run() [clone .lto_priv.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/system/detail/error_code.hpp:305:13 "/lib64/libstdc++.so.6(+0xdb924) [0x7fd3143db924]",' "/lib64/libc.so.6(+0x9f802) [0x7fd31409e802]",' "/lib64/libc.so.6(+0x3f450) [0x7fd31403e450]"'
continuing from the last BZ comment:
(In reply to Mark Kogan from comment #15)
i think you have this part backwards. the call had previously been wrapped in a `if (!ec) {` block which means there was no error
errors here are common because of http keepalive. the server keeps trying to read more requests from the client until the client hangs up, where the server sees errors like ECONNRESET
Thanks Casey,
Suggesting that will check exactly which error is the 'normal' error
(ECONNRESET or other)
and add back the if to perform async_shutdown() during only the normal error
and no error
like for example:
if (!ec || ec == boost::asio::error::connection_reset) { ...
stream.async_shutdown() ...still interested in finding a root cause for the crash. are there really no rgw logs from qe? the dump leading up to the crash would be really valuable. @Tejas?
a note from https://www.openssl.org/docs/man1.1.1/man3/SSL_shutdown.html:
Note that SSL_shutdown() must not be called if a previous fatal error has occurred on a connection i.e. if SSL_get_error() has returned SSL_ERROR_SYSCALL or SSL_ERROR_SSL.
unrelated to the crash, but this is probably why it had the `if (!ec) {` condition. not all errors here would be fatal to the connection, though. for example, boost::asio::error::operation_aborted would indicate a read/write timeout on our end, but the connection would remain intact
ultimately we want to allow for ssl session reuse in all possible cases, but it would be useful to categorize which cases are really possible. part of the responsibility lies with the client to allow for clean shutdown before closing their end of the socket
for your `s_client --reconnect` reproducer in https://github.com/ceph/ceph/pull/55967, what error code leads to our call to async_shutdown()?
Updated by Mark Kogan 23 days ago · Edited
following testing with openssl s_client, s3cmd and Warp, the following ec's occur under normal conditions:
if (!ec || ec == ssl::error::stream_truncated || ec == http::error::end_of_stream) { // ssl shutdown (ignoring errors) stream.async_shutdown(yield[ec]); }
Updated by Casey Bodley 23 days ago
quoting https://www.boost.org/doc/libs/1_82_0/doc/html/boost_asio/reference/ssl__error__stream_errors.html:
stream_truncated: The underlying stream closed before the ssl stream gracefully shut down.
should we even try calling
async_shutdown()
in this case? do you see that from s_client
, and does calling async_shutdown()
allow for session id reuse?Updated by Mark Kogan 19 days ago · Edited
- Pull request ID set to 57155
ACK,
testing with:
echo "" | openssl s_client -connect localhost:8443 --reconnect -no_ticket -tls1_2 |& grep 'Session-ID:' Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
logging the `ec <<" / "<< ec.message()` it's
ec=beast.http:1 / end of stream ec=beast.http:1 / end of stream ec=beast.http:1 / end of stream ec=beast.http:1 / end of stream ec=beast.http:1 / end of stream ec=system:0 / Success
please see the PR which perform the async_shutdown() conditionally according to the above
note: in stress testing managed to repro the following `ec` errors, though they did not result in a crash (on the upstream main codebase)
ec=system:104 / Connection reset by peer ec=system:125 / Operation canceled ec=system:32 / Broken pipe ec=system:107 / Transport endpoint is not connected
Updated by Casey Bodley 17 days ago
- Tags set to beast ssl
- Backport set to quincy reef squid
Updated by Casey Bodley 17 days ago
- Related to Bug #65742: beast: revert changes to ssl async_shutdown() added
Updated by Casey Bodley 17 days ago
as discussed, we'll revert this for main/squid until we have a chance to validate the fix. the reverts are tracked in https://tracker.ceph.com/issues/65742
unfortunately, the quincy and reef backports of https://tracker.ceph.com/issues/64719 are already done
Updated by Mark Kogan 12 days ago · Edited
Updating:
Managed to repro the crash repeatedly and verify that the fix PR does resolve the issue.
Details:
Before the fix PR Reproduces consistently in start env during RGW shutdown under load
(which was performed in QEs tests also)
of a downstream ver ceph-18.2.0-189 (same ver that was reported by QE) branch
* (HEAD detached at ceph-18.2.0-189) cf8c965631c ceph-volume: default type to 'bluestore' in 'raw list'
Did not repro on main (albeit on a different machine which can affect the timing)
Reproduction occurs when performing RGW shut-down:
sudo pkill -2 radosgw # SIGINT
While under a high load of specific characteristics only - 900 concurrent GET req of 4KB size:
nice numactl -N 1 -m 1 -- warp get --host=127.0.0.1:8443 --tls=true --insecure --access-key=b2345678901234567890 --secret-key=b234567890123456789012345678901234567890 --concurrent=900 --obj.size=4K --duration=24h --objects=100000 --bucket="b01b-${RANDOM}" --noclear
The call-stack:
... -405> 2024-05-06T18:27:32.837+0000 7ffff7e0e040 -1 shutting down -12> 2024-05-06T18:27:32.842+0000 7ffec6093700 -1 *** Caught signal (Segmentation fault) ** in thread 7ffec6093700 thread_name:radosgw ceph version 18.2.0-603-gcf8c965631c (cf8c965631cc7e6281a5827dae32bd53b7a88a79) reef (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7ffff4573cf0] 2: (boost::asio::detail::reactive_socket_service_base::start_op(boost::asio::detail::reactive_socket_service_base::base_implementation_type&, int, boost::asio::detail::reactor_op*, bool, b ool, bool)+0x117) [0x555555a234d7] 3: (void boost::asio::detail::reactive_socket_service_base::async_receive<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip:: tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), b oost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >(boost::asi o::detail::reactive_socket_service_base::base_implementation_type&, boost::asio::mutable_buffers_1 const&, int, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio: :ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*) (), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >&, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const& )+0x1fc) [0x555555a3ce7c] 4: (boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::deta il::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >::operator()(boost::system::error_code, unsigned long, int)+0x5be) [0x555555a43c3e] 5: (boost::asio::detail::executor_op<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_exec utor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_conte xt::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void *, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x17b) [0x555555a454bb] 6: (boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>::operator()()+0x85) [0x555555a57605] 7: (void boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul>::execute<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type <std::allocator<void>, 4ul> const, void> >(boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>&&) cons t+0x167) [0x555555a57827] 8: (void boost::asio::detail::strand_executor_service::do_execute<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, boost::asio::detail::binder2<boost::asio::s sl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, sp awn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::e rror_code, unsigned long>, std::allocator<void> >(std::shared_ptr<boost::asio::detail::strand_executor_service::strand_impl> const&, boost::asio::io_context::basic_executor_type<std::alloca tor<void>, 4ul> const&, boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std: :allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>&&, std::allocator<void> const&)+0x149) [0x555555a59709] 9: (boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_cont ext::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost: :asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::de tail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1bd) [0x555555a599fd] 10: ./bin/radosgw(+0x4d5059) [0x555555a29059] 11: ./bin/radosgw(+0x4d542a) [0x555555a2942a] 12: /lib64/libstdc++.so.6(+0xc2b23) [0x7ffff428eb23] 13: /lib64/libpthread.so.0(+0x81ca) [0x7ffff45691ca] 14: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
After the fix PR the seg fault does not repro anymore, and the openssl session reuse is working fine:
echo "" | openssl s_client -connect localhost:8443 --reconnect -no_ticket -tls1_2 |& grep 'Session-ID:' Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA
Updated by Casey Bodley 11 days ago
- Status changed from In Progress to Fix Under Review
Updated by Casey Bodley 9 days ago
- Status changed from Fix Under Review to Pending Backport
Updated by Casey Bodley 9 days ago
- Copied to Backport #65886: reef: Crash observed in boost::asio module related to stream.async_shutdown() added
Updated by Casey Bodley 9 days ago
- Copied to Backport #65887: quincy: Crash observed in boost::asio module related to stream.async_shutdown() added
Updated by Casey Bodley 9 days ago
- Copied to Backport #65888: squid: Crash observed in boost::asio module related to stream.async_shutdown() added
Updated by Casey Bodley 9 days ago
- Tags changed from beast ssl to beast ssl backport_processed