Project

General

Profile

Actions

Bug #65664

open

Crash observed in boost::asio module related to stream.async_shutdown()

Added by Mark Kogan 23 days ago. Updated 9 days ago.

Status:
Pending Backport
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
beast ssl backport_processed
Backport:
quincy reef squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

continuing from downstream BZ#2275284

call stack:

completing the missing callstack symbols using addr2line:

"backtrace": ['
"/lib64/libc.so.6(+0x54db0) [0x7fd314053db0]",'

"/usr/bin/radosgw(+0x33b8ea) [0x55783b4ae8ea]",'
0x000000000033b8ea: boost::asio::detail::epoll_reactor::start_op(int, int, boost::asio::detail::epoll_reactor::descriptor_state*&, boost::asio::detail::reactor_op*, bool, bool) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/epoll_reactor.ipp:246:3

"/usr/bin/radosgw(+0x35ba27) [0x55783b4cea27]",'
0x000000000035ba27: boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >::operator()(boost::system::error_code, unsigned long, int) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/reactive_socket_service_base.hpp:419:13

"(boost::asio::detail::executor_op<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1d2) [0x55783b4ef882]",'

"/usr/bin/radosgw(+0x3807de) [0x55783b4f37de]",'
0x00000000003807de: boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>::operator()() at /usr/include/c++/11/bits/shared_ptr_base.h:1296:16

"/usr/bin/radosgw(+0x379910) [0x55783b4ec910]",'
0x0000000000379910: void boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul>::execute<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void> >(boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>&&) const at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/impl/io_context.hpp:300:3

"(boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x6a6) [0x55783b4dda06]",'

"/usr/bin/radosgw(+0xb8534e) [0x55783bcf834e]",'
0x0000000000b8534e: boost::asio::detail::thread_info_base::rethrow_pending_exception() at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/thread_info_base.hpp:228:5
 (inlined by) boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:493:46
 (inlined by) boost::asio::detail::scheduler::run(boost::system::error_code&) [clone .constprop.0] [clone .isra.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:210:20

"/usr/bin/radosgw(+0x3cf04d) [0x55783b54204d]",'
0x00000000003cf04d: std::thread::_State_impl<std::thread::_Invoker<std::tuple<(anonymous namespace)::AsioFrontend::run()::{lambda()#2}> > >::_M_run() [clone .lto_priv.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/system/detail/error_code.hpp:305:13

"/lib64/libstdc++.so.6(+0xdb924) [0x7fd3143db924]",'
"/lib64/libc.so.6(+0x9f802) [0x7fd31409e802]",'
"/lib64/libc.so.6(+0x3f450) [0x7fd31403e450]"'

continuing from the last BZ comment:

(In reply to Mark Kogan from comment #15)

i think you have this part backwards. the call had previously been wrapped in a `if (!ec) {` block which means there was no error

errors here are common because of http keepalive. the server keeps trying to read more requests from the client until the client hangs up, where the server sees errors like ECONNRESET

Thanks Casey,
Suggesting that will check exactly which error is the 'normal' error
(ECONNRESET or other)
and add back the if to perform async_shutdown() during only the normal error
and no error
like for example:
if (!ec || ec == boost::asio::error::connection_reset) { ...
stream.async_shutdown() ...

still interested in finding a root cause for the crash. are there really no rgw logs from qe? the dump leading up to the crash would be really valuable. @Tejas?

a note from https://www.openssl.org/docs/man1.1.1/man3/SSL_shutdown.html:

Note that SSL_shutdown() must not be called if a previous fatal error has occurred on a connection i.e. if SSL_get_error() has returned SSL_ERROR_SYSCALL or SSL_ERROR_SSL.

unrelated to the crash, but this is probably why it had the `if (!ec) {` condition. not all errors here would be fatal to the connection, though. for example, boost::asio::error::operation_aborted would indicate a read/write timeout on our end, but the connection would remain intact

ultimately we want to allow for ssl session reuse in all possible cases, but it would be useful to categorize which cases are really possible. part of the responsibility lies with the client to allow for clean shutdown before closing their end of the socket

for your `s_client --reconnect` reproducer in https://github.com/ceph/ceph/pull/55967, what error code leads to our call to async_shutdown()?


Related issues 4 (3 open1 closed)

Related to rgw - Bug #65742: beast: revert changes to ssl async_shutdown()RejectedCasey Bodley

Actions
Copied to rgw - Backport #65886: reef: Crash observed in boost::asio module related to stream.async_shutdown()In ProgressMark KoganActions
Copied to rgw - Backport #65887: quincy: Crash observed in boost::asio module related to stream.async_shutdown()In ProgressMark KoganActions
Copied to rgw - Backport #65888: squid: Crash observed in boost::asio module related to stream.async_shutdown()In ProgressCasey BodleyActions
Actions #1

Updated by Mark Kogan 23 days ago · Edited

following testing with openssl s_client, s3cmd and Warp, the following ec's occur under normal conditions:

    if (!ec || ec == ssl::error::stream_truncated || ec == http::error::end_of_stream) {    
        // ssl shutdown (ignoring errors)
        stream.async_shutdown(yield[ec]);
    }

Actions #2

Updated by Casey Bodley 23 days ago

quoting https://www.boost.org/doc/libs/1_82_0/doc/html/boost_asio/reference/ssl__error__stream_errors.html:

stream_truncated: The underlying stream closed before the ssl stream gracefully shut down.

should we even try calling async_shutdown() in this case? do you see that from s_client, and does calling async_shutdown() allow for session id reuse?

Actions #3

Updated by Mark Kogan 18 days ago · Edited

  • Pull request ID set to 57155

ACK,

testing with:

echo "" | openssl s_client -connect localhost:8443 --reconnect -no_ticket -tls1_2 |& grep 'Session-ID:'
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F
    Session-ID: E16D5C7C5AF8E796407671764F550938DE44CC905CC8907BD267B3748B6EE72F

logging the `ec <<" / "<< ec.message()` it's

ec=beast.http:1 / end of stream
ec=beast.http:1 / end of stream
ec=beast.http:1 / end of stream
ec=beast.http:1 / end of stream
ec=beast.http:1 / end of stream
ec=system:0 / Success          

please see the PR which perform the async_shutdown() conditionally according to the above

note: in stress testing managed to repro the following `ec` errors, though they did not result in a crash (on the upstream main codebase)

ec=system:104 / Connection reset by peer
ec=system:125 / Operation canceled
ec=system:32 / Broken pipe
ec=system:107 / Transport endpoint is not connected

Actions #4

Updated by Casey Bodley 17 days ago

  • Tags set to beast ssl
  • Backport set to quincy reef squid
Actions #5

Updated by Casey Bodley 17 days ago

  • Related to Bug #65742: beast: revert changes to ssl async_shutdown() added
Actions #6

Updated by Casey Bodley 17 days ago

as discussed, we'll revert this for main/squid until we have a chance to validate the fix. the reverts are tracked in https://tracker.ceph.com/issues/65742

unfortunately, the quincy and reef backports of https://tracker.ceph.com/issues/64719 are already done

Actions #7

Updated by Mark Kogan 11 days ago · Edited

Updating:

Managed to repro the crash repeatedly and verify that the fix PR does resolve the issue.

Details:

Before the fix PR Reproduces consistently in start env during RGW shutdown under load
(which was performed in QEs tests also)
of a downstream ver ceph-18.2.0-189 (same ver that was reported by QE) branch
* (HEAD detached at ceph-18.2.0-189) cf8c965631c ceph-volume: default type to 'bluestore' in 'raw list'
Did not repro on main (albeit on a different machine which can affect the timing)

Reproduction occurs when performing RGW shut-down:

sudo pkill -2 radosgw  # SIGINT

While under a high load of specific characteristics only - 900 concurrent GET req of 4KB size:
nice numactl -N 1 -m 1 -- warp get --host=127.0.0.1:8443 --tls=true --insecure --access-key=b2345678901234567890 --secret-key=b234567890123456789012345678901234567890 --concurrent=900 --obj.size=4K --duration=24h --objects=100000 --bucket="b01b-${RANDOM}" --noclear

The call-stack:

...
  -405> 2024-05-06T18:27:32.837+0000 7ffff7e0e040 -1 shutting down
   -12> 2024-05-06T18:27:32.842+0000 7ffec6093700 -1 *** Caught signal (Segmentation fault) **
 in thread 7ffec6093700 thread_name:radosgw

 ceph version 18.2.0-603-gcf8c965631c (cf8c965631cc7e6281a5827dae32bd53b7a88a79) reef (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7ffff4573cf0]
 2: (boost::asio::detail::reactive_socket_service_base::start_op(boost::asio::detail::reactive_socket_service_base::base_implementation_type&, int, boost::asio::detail::reactor_op*, bool, b
ool, bool)+0x117) [0x555555a234d7]
 3: (void boost::asio::detail::reactive_socket_service_base::async_receive<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::
tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), b
oost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >(boost::asi
o::detail::reactive_socket_service_base::base_implementation_type&, boost::asio::mutable_buffers_1 const&, int, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio:
:ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)
(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >&, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const&
)+0x1fc) [0x555555a3ce7c]
 4: (boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::deta
il::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void>
>::operator()(boost::system::error_code, unsigned long, int)+0x5be) [0x555555a43c3e]
 5: (boost::asio::detail::executor_op<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_exec
utor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_conte
xt::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void
*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x17b) [0x555555a454bb]
 6: (boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>::operator()()+0x85) [0x555555a57605]
 7: (void boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul>::execute<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type
<std::allocator<void>, 4ul> const, void> >(boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>&&) cons
t+0x167) [0x555555a57827]
 8: (void boost::asio::detail::strand_executor_service::do_execute<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, boost::asio::detail::binder2<boost::asio::s
sl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, sp
awn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::e
rror_code, unsigned long>, std::allocator<void> >(std::shared_ptr<boost::asio::detail::strand_executor_service::strand_impl> const&, boost::asio::io_context::basic_executor_type<std::alloca
tor<void>, 4ul> const&, boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std:
:allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>&&, std::allocator<void> const&)+0x149) [0x555555a59709]
 9: (boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_cont
ext::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost:
:asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::de
tail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1bd) [0x555555a599fd]
 10: ./bin/radosgw(+0x4d5059) [0x555555a29059]
 11: ./bin/radosgw(+0x4d542a) [0x555555a2942a]
 12: /lib64/libstdc++.so.6(+0xc2b23) [0x7ffff428eb23]
 13: /lib64/libpthread.so.0(+0x81ca) [0x7ffff45691ca]
 14: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

After the fix PR the seg fault does not repro anymore, and the openssl session reuse is working fine:

echo "" | openssl s_client -connect localhost:8443 --reconnect -no_ticket -tls1_2 |& grep 'Session-ID:'
    Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA
    Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA
    Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA
    Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA
    Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA
    Session-ID: C84271A7E7B7A28C62EAD82391171BEEC01EA9BC69C0DA8623A0E8E9A057EAEA

Actions #8

Updated by Casey Bodley 11 days ago

  • Status changed from In Progress to Fix Under Review
Actions #9

Updated by Casey Bodley 9 days ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Casey Bodley 9 days ago

  • Copied to Backport #65886: reef: Crash observed in boost::asio module related to stream.async_shutdown() added
Actions #11

Updated by Casey Bodley 9 days ago

  • Copied to Backport #65887: quincy: Crash observed in boost::asio module related to stream.async_shutdown() added
Actions #12

Updated by Casey Bodley 9 days ago

  • Copied to Backport #65888: squid: Crash observed in boost::asio module related to stream.async_shutdown() added
Actions #13

Updated by Casey Bodley 9 days ago

  • Tags changed from beast ssl to beast ssl backport_processed
Actions

Also available in: Atom PDF