Project

General

Profile

Actions

Bug #65664

open

Crash observed in boost::asio module related to stream.async_shutdown()

Added by Mark Kogan 9 days ago. Updated 3 days ago.

Status:
In Progress
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
beast ssl
Backport:
quincy reef squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

continuing from downstream BZ#2275284

call stack:

completing the missing callstack symbols using addr2line:

"backtrace": ['
"/lib64/libc.so.6(+0x54db0) [0x7fd314053db0]",'

"/usr/bin/radosgw(+0x33b8ea) [0x55783b4ae8ea]",'
0x000000000033b8ea: boost::asio::detail::epoll_reactor::start_op(int, int, boost::asio::detail::epoll_reactor::descriptor_state*&, boost::asio::detail::reactor_op*, bool, bool) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/epoll_reactor.ipp:246:3

"/usr/bin/radosgw(+0x35ba27) [0x55783b4cea27]",'
0x000000000035ba27: boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >::operator()(boost::system::error_code, unsigned long, int) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/reactive_socket_service_base.hpp:419:13

"(boost::asio::detail::executor_op<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1d2) [0x55783b4ef882]",'

"/usr/bin/radosgw(+0x3807de) [0x55783b4f37de]",'
0x00000000003807de: boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>::operator()() at /usr/include/c++/11/bits/shared_ptr_base.h:1296:16

"/usr/bin/radosgw(+0x379910) [0x55783b4ec910]",'
0x0000000000379910: void boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul>::execute<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void> >(boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>&&) const at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/impl/io_context.hpp:300:3

"(boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x6a6) [0x55783b4dda06]",'

"/usr/bin/radosgw(+0xb8534e) [0x55783bcf834e]",'
0x0000000000b8534e: boost::asio::detail::thread_info_base::rethrow_pending_exception() at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/thread_info_base.hpp:228:5
 (inlined by) boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:493:46
 (inlined by) boost::asio::detail::scheduler::run(boost::system::error_code&) [clone .constprop.0] [clone .isra.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:210:20

"/usr/bin/radosgw(+0x3cf04d) [0x55783b54204d]",'
0x00000000003cf04d: std::thread::_State_impl<std::thread::_Invoker<std::tuple<(anonymous namespace)::AsioFrontend::run()::{lambda()#2}> > >::_M_run() [clone .lto_priv.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/system/detail/error_code.hpp:305:13

"/lib64/libstdc++.so.6(+0xdb924) [0x7fd3143db924]",'
"/lib64/libc.so.6(+0x9f802) [0x7fd31409e802]",'
"/lib64/libc.so.6(+0x3f450) [0x7fd31403e450]"'

continuing from the last BZ comment:

(In reply to Mark Kogan from comment #15)

i think you have this part backwards. the call had previously been wrapped in a `if (!ec) {` block which means there was no error

errors here are common because of http keepalive. the server keeps trying to read more requests from the client until the client hangs up, where the server sees errors like ECONNRESET

Thanks Casey,
Suggesting that will check exactly which error is the 'normal' error
(ECONNRESET or other)
and add back the if to perform async_shutdown() during only the normal error
and no error
like for example:
if (!ec || ec == boost::asio::error::connection_reset) { ...
stream.async_shutdown() ...

still interested in finding a root cause for the crash. are there really no rgw logs from qe? the dump leading up to the crash would be really valuable. @Tejas?

a note from https://www.openssl.org/docs/man1.1.1/man3/SSL_shutdown.html:

Note that SSL_shutdown() must not be called if a previous fatal error has occurred on a connection i.e. if SSL_get_error() has returned SSL_ERROR_SYSCALL or SSL_ERROR_SSL.

unrelated to the crash, but this is probably why it had the `if (!ec) {` condition. not all errors here would be fatal to the connection, though. for example, boost::asio::error::operation_aborted would indicate a read/write timeout on our end, but the connection would remain intact

ultimately we want to allow for ssl session reuse in all possible cases, but it would be useful to categorize which cases are really possible. part of the responsibility lies with the client to allow for clean shutdown before closing their end of the socket

for your `s_client --reconnect` reproducer in https://github.com/ceph/ceph/pull/55967, what error code leads to our call to async_shutdown()?


Related issues 1 (1 open0 closed)

Related to rgw - Bug #65742: beast: revert changes to ssl async_shutdown()Fix Under ReviewCasey Bodley

Actions
Actions #1

Updated by Mark Kogan 9 days ago · Edited

following testing with openssl s_client, s3cmd and Warp, the following ec's occur under normal conditions:

    if (!ec || ec == ssl::error::stream_truncated || ec == http::error::end_of_stream) {    
        // ssl shutdown (ignoring errors)
        stream.async_shutdown(yield[ec]);
    }

Actions #2

Updated by Casey Bodley 9 days ago

quoting https://www.boost.org/doc/libs/1_82_0/doc/html/boost_asio/reference/ssl__error__stream_errors.html:

stream_truncated: The underlying stream closed before the ssl stream gracefully shut down.

should we even try calling async_shutdown() in this case? do you see that from s_client, and does calling async_shutdown() allow for session id reuse?

Actions #3

Updated by Mark Kogan 4 days ago · Edited

  • Pull request ID set to 57155

ACK,

testing with:

echo "" | openssl s_client -connect localhost:8443 --reconnect -no_ticket -tls1_2 |& grep 'Session-ID:'
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F

logging the `ec <<" / "<< ec.message()` it's

ec=beast.http:1 / end of stream
ec=beast.http:1 / end of stream
ec=beast.http:1 / end of stream
ec=beast.http:1 / end of stream
ec=beast.http:1 / end of stream
ec=system:0 / Success          

please see the PR which perform the async_shutdown() conditionally according to the above

note: in stress testing managed to repro the following `ec` errors, though they did not result in a crash (on the upstream main codebase)

ec=system:104 / Connection reset by peer
ec=system:125 / Operation canceled
ec=system:32 / Broken pipe
ec=system:107 / Transport endpoint is not connected

Actions #4

Updated by Casey Bodley 3 days ago

  • Tags set to beast ssl
  • Backport set to quincy reef squid
Actions #5

Updated by Casey Bodley 3 days ago

  • Related to Bug #65742: beast: revert changes to ssl async_shutdown() added
Actions #6

Updated by Casey Bodley 3 days ago

as discussed, we'll revert this for main/squid until we have a chance to validate the fix. the reverts are tracked in https://tracker.ceph.com/issues/65742

unfortunately, the quincy and reef backports of https://tracker.ceph.com/issues/64719 are already done

Actions

Also available in: Atom PDF