Bug #65664
openCrash observed in boost::asio module related to stream.async_shutdown()
0%
Description
continuing from downstream BZ#2275284
call stack:
completing the missing callstack symbols using addr2line: "backtrace": [' "/lib64/libc.so.6(+0x54db0) [0x7fd314053db0]",' "/usr/bin/radosgw(+0x33b8ea) [0x55783b4ae8ea]",' 0x000000000033b8ea: boost::asio::detail::epoll_reactor::start_op(int, int, boost::asio::detail::epoll_reactor::descriptor_state*&, boost::asio::detail::reactor_op*, bool, bool) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/epoll_reactor.ipp:246:3 "/usr/bin/radosgw(+0x35ba27) [0x55783b4cea27]",' 0x000000000035ba27: boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >::operator()(boost::system::error_code, unsigned long, int) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/reactive_socket_service_base.hpp:419:13 "(boost::asio::detail::executor_op<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1d2) [0x55783b4ef882]",' "/usr/bin/radosgw(+0x3807de) [0x55783b4f37de]",' 0x00000000003807de: boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>::operator()() at /usr/include/c++/11/bits/shared_ptr_base.h:1296:16 "/usr/bin/radosgw(+0x379910) [0x55783b4ec910]",' 0x0000000000379910: void boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul>::execute<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void> >(boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>&&) const at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/impl/io_context.hpp:300:3 "(boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x6a6) [0x55783b4dda06]",' "/usr/bin/radosgw(+0xb8534e) [0x55783bcf834e]",' 0x0000000000b8534e: boost::asio::detail::thread_info_base::rethrow_pending_exception() at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/thread_info_base.hpp:228:5 (inlined by) boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:493:46 (inlined by) boost::asio::detail::scheduler::run(boost::system::error_code&) [clone .constprop.0] [clone .isra.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:210:20 "/usr/bin/radosgw(+0x3cf04d) [0x55783b54204d]",' 0x00000000003cf04d: std::thread::_State_impl<std::thread::_Invoker<std::tuple<(anonymous namespace)::AsioFrontend::run()::{lambda()#2}> > >::_M_run() [clone .lto_priv.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/system/detail/error_code.hpp:305:13 "/lib64/libstdc++.so.6(+0xdb924) [0x7fd3143db924]",' "/lib64/libc.so.6(+0x9f802) [0x7fd31409e802]",' "/lib64/libc.so.6(+0x3f450) [0x7fd31403e450]"'
continuing from the last BZ comment:
(In reply to Mark Kogan from comment #15)
i think you have this part backwards. the call had previously been wrapped in a `if (!ec) {` block which means there was no error
errors here are common because of http keepalive. the server keeps trying to read more requests from the client until the client hangs up, where the server sees errors like ECONNRESET
Thanks Casey,
Suggesting that will check exactly which error is the 'normal' error
(ECONNRESET or other)
and add back the if to perform async_shutdown() during only the normal error
and no error
like for example:
if (!ec || ec == boost::asio::error::connection_reset) { ...
stream.async_shutdown() ...still interested in finding a root cause for the crash. are there really no rgw logs from qe? the dump leading up to the crash would be really valuable. @Tejas?
a note from https://www.openssl.org/docs/man1.1.1/man3/SSL_shutdown.html:
Note that SSL_shutdown() must not be called if a previous fatal error has occurred on a connection i.e. if SSL_get_error() has returned SSL_ERROR_SYSCALL or SSL_ERROR_SSL.
unrelated to the crash, but this is probably why it had the `if (!ec) {` condition. not all errors here would be fatal to the connection, though. for example, boost::asio::error::operation_aborted would indicate a read/write timeout on our end, but the connection would remain intact
ultimately we want to allow for ssl session reuse in all possible cases, but it would be useful to categorize which cases are really possible. part of the responsibility lies with the client to allow for clean shutdown before closing their end of the socket
for your `s_client --reconnect` reproducer in https://github.com/ceph/ceph/pull/55967, what error code leads to our call to async_shutdown()?
Updated by Mark Kogan 9 days ago · Edited
following testing with openssl s_client, s3cmd and Warp, the following ec's occur under normal conditions:
if (!ec || ec == ssl::error::stream_truncated || ec == http::error::end_of_stream) { // ssl shutdown (ignoring errors) stream.async_shutdown(yield[ec]); }
Updated by Casey Bodley 9 days ago
quoting https://www.boost.org/doc/libs/1_82_0/doc/html/boost_asio/reference/ssl__error__stream_errors.html:
stream_truncated: The underlying stream closed before the ssl stream gracefully shut down.
should we even try calling
async_shutdown()
in this case? do you see that from s_client
, and does calling async_shutdown()
allow for session id reuse?Updated by Mark Kogan 4 days ago · Edited
- Pull request ID set to 57155
ACK,
testing with:
echo "" | openssl s_client -connect localhost:8443 --reconnect -no_ticket -tls1_2 |& grep 'Session-ID:' Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
logging the `ec <<" / "<< ec.message()` it's
ec=beast.http:1 / end of stream ec=beast.http:1 / end of stream ec=beast.http:1 / end of stream ec=beast.http:1 / end of stream ec=beast.http:1 / end of stream ec=system:0 / Success
please see the PR which perform the async_shutdown() conditionally according to the above
note: in stress testing managed to repro the following `ec` errors, though they did not result in a crash (on the upstream main codebase)
ec=system:104 / Connection reset by peer ec=system:125 / Operation canceled ec=system:32 / Broken pipe ec=system:107 / Transport endpoint is not connected
Updated by Casey Bodley 3 days ago
- Tags set to beast ssl
- Backport set to quincy reef squid
Updated by Casey Bodley 3 days ago
- Related to Bug #65742: beast: revert changes to ssl async_shutdown() added
Updated by Casey Bodley 3 days ago
as discussed, we'll revert this for main/squid until we have a chance to validate the fix. the reverts are tracked in https://tracker.ceph.com/issues/65742
unfortunately, the quincy and reef backports of https://tracker.ceph.com/issues/64719 are already done