Project

General

Profile

Actions

Bug #65337

open

rgw: Segmentation fault in rgw::notify::Manager during realm reload

Added by J. Eric Ivancich 27 days ago. Updated 13 days ago.

Status:
Fix Under Review
Priority:
Urgent
Target version:
% Done:

0%

Source:
Tags:
notifications
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Notice that the class rgw::notify::Manager::process_queue is embedded in the stack trace, which leads me to believe it's a watch/notify issue.

2024-04-04T16:21:07.214 INFO:tasks.rgw.client.1.smithi136.stdout: 0> 2024-04-04T16:21:07.188+0000 7f33fb55a640 -1 ** Caught signal (Segmentation fault) *
2024-04-04T16:21:07.214 INFO:tasks.rgw.client.1.smithi136.stdout: in thread 7f33fb55a640 thread_name:safe_timer
2024-04-04T16:21:07.214 INFO:tasks.rgw.client.1.smithi136.stdout:
2024-04-04T16:21:07.214 INFO:tasks.rgw.client.1.smithi136.stdout: ceph version 19.0.0-2638-g6c2eea20 (6c2eea201c76ffcac93b7a9835b2ecce3eee2d92) squid (dev)
2024-04-04T16:21:07.214 INFO:tasks.rgw.client.1.smithi136.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f352f7be520]
2024-04-04T16:21:07.214 INFO:tasks.rgw.client.1.smithi136.stdout: 2: radosgw(+0x898dfc) [0x56275dcc0dfc]
2024-04-04T16:21:07.214 INFO:tasks.rgw.client.1.smithi136.stdout: 3: radosgw(+0x398015) [0x56275d7c0015]
2024-04-04T16:21:07.214 INFO:tasks.rgw.client.1.smithi136.stdout: 4: (void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void ()(), boost::asio::any_io_executor>, *rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void ()(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void ()(), boost::asio::any_io_executor> >)#7}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)+0x55) [0x56275dcc6315]
2024-04-04T16:21:07.214 INFO:tasks.rgw.client.1.smithi136.stdout: 5: make_fcontext()

Teuthology run: https://pulpito.ceph.com/ivancich-2024-04-04_14:54:02-rgw-wip-eric-testing-1-distro-default-smithi/7640017/

Actions #1

Updated by Casey Bodley 27 days ago

  • Assignee set to Yuval Lifshitz
  • Priority changed from Normal to Urgent

Notice that the class rgw::notify::Manager::process_queue is embedded in the stack trace, which leads me to believe it's a watch/notify issue.

this rgw::notify::Manager class is responsible for sending notification events corresponding to the s3 bucket notification feature, so not related to rados watch/notify

in thread 7f33fb55a640 thread_name:safe_timer

i see that this happened during the new rgw/notifications test case "test data path v2 persistent migration". that test issues radosgw-admin period commit commands which would trigger realm reloads, so i'm guessing this safe_timer thread corresponds to the SafeTimer timer; member of class RGWRealmReloader

Actions #2

Updated by Casey Bodley 27 days ago

the rgw log corresponding to rgw.client.1 is flooded with curl errors of the form:

Couldn't connect to server req_data->error_buf=Failed to connect to localhost port 10474 after 4 ms: Connection refused

these errors continue to spam the log after the realm reloader starts shutting down the rados store

2024-04-04T16:21:07.020+0000 7f35287d5640 20 link_request req_data=0x56276354d0e0 req_data->id=5343654, curl_handle=0x5627652027a0
2024-04-04T16:21:07.020+0000 7f35287d5640 20 link_request req_data=0x5627633efa40 req_data->id=5343655, curl_handle=0x562765202f40
2024-04-04T16:21:07.020+0000 7f35287d5640 20 link_request req_data=0x5627633ef2c0 req_data->id=5343656, curl_handle=0x5627610358e0
2024-04-04T16:21:07.020+0000 7f35287d5640 20 link_request req_data=0x562760f252c0 req_data->id=5343657, curl_handle=0x562760c0d440
2024-04-04T16:21:07.020+0000 7f35287d5640 20 link_request req_data=0x562760f25860 req_data->id=5343658, curl_handle=0x5627633a2c00
2024-04-04T16:21:07.024+0000 7f33fad59640  4 rgw period pusher: No zones to update
2024-04-04T16:21:07.024+0000 7f33fad59640  4 rgw realm reloader: Notification on realm, reconfiguration scheduled
2024-04-04T16:21:07.024+0000 7f33fb55a640  1 rgw realm reloader: Pausing frontends for realm update...
2024-04-04T16:21:07.024+0000 7f33fb55a640  4 frontend pausing connections...
2024-04-04T16:21:07.024+0000 7f33fb55a640  4 frontend paused
2024-04-04T16:21:07.024+0000 7f33fb55a640  4 rgw period pusher: paused for realm update
2024-04-04T16:21:07.024+0000 7f33fb55a640  1 rgw realm reloader: Frontends paused
2024-04-04T16:21:07.024+0000 7f35287d5640 20 ERROR: msg->data.result=7 req_data->id=5343564 http_status=0
2024-04-04T16:21:07.024+0000 7f35287d5640 20 ERROR: curl error: Couldn't connect to server req_data->error_buf=Failed to connect to localhost port 10474 after 4 ms: Connection refused
2024-04-04T16:21:07.024+0000 7f35287d5640 20 ERROR: msg->data.result=7 req_data->id=5343565 http_status=0
2024-04-04T16:21:07.024+0000 7f35287d5640 20 ERROR: curl error: Couldn't connect to server req_data->error_buf=Failed to connect to localhost port 10474 after 4 ms: Connection refused

...
  -611> 2024-04-04T16:21:07.176+0000 7f33fed82640  5 rgw notify: WARNING: push entry marker: 0/136052 failed. error: -5 (will retry) for event with notification id: 'rbuiyj-14_notif', topic: 'rbuiyj-14_topic', endpoint: 'http://localhost:10474', bucket_owner: 'foo.client.0', bucket: 'rbuiyj-14', object: 'key-99', event type: 'ObjectRemoved:Delete'
  -610> 2024-04-04T16:21:07.176+0000 7f33fed82640  5 rgw notify: WARNING: push entry marker: 0/136623 failed. error: -5 (will retry) for event with notification id: 'rbuiyj-14_notif', topic: 'rbuiyj-14_topic', endpoint: 'http://localhost:10474', bucket_owner: 'foo.client.0', bucket: 'rbuiyj-14', object: 'key-93', event type: 'ObjectRemoved:Delete'
  -609> 2024-04-04T16:21:07.176+0000 7f33fed82640  5 rgw notify: WARNING: push entry marker: 0/137193 failed. error: -5 (will retry) for event with notification id: 'rbuiyj-14_notif', topic: 'rbuiyj-14_topic', endpoint: 'http://localhost:10474', bucket_owner: 'foo.client.0', bucket: 'rbuiyj-14', object: 'key-75', event type: 'ObjectRemoved:Delete'
  -608> 2024-04-04T16:21:07.176+0000 7f33fed82640  5 rgw notify: WARNING: push entry marker: 0/137763 failed. error: -5 (will retry) for event with notification id: 'rbuiyj-14_notif', topic: 'rbuiyj-14_topic', endpoint: 'http://localhost:10474', bucket_owner: 'foo.client.0', bucket: 'rbuiyj-14', object: 'key-88', event type: 'ObjectRemoved:Delete'
  -607> 2024-04-04T16:21:07.176+0000 7f33fb55a640 20 remove_watcher() i=5
  -606> 2024-04-04T16:21:07.180+0000 7f33fed82640 20 sending request to http://localhost:10474
  -605> 2024-04-04T16:21:07.180+0000 7f33fed82640 20 register_request mgr=0x562760a56000 req_data->id=5345459, curl_handle=0x56276407ef40
  -604> 2024-04-04T16:21:07.180+0000 7f35287d5640 20 link_request req_data=0x562760f25c20 req_data->id=5345459, curl_handle=0x56276407ef40
  -603> 2024-04-04T16:21:07.180+0000 7f33fed82640 20 sending request to http://localhost:10474

...
  -520> 2024-04-04T16:21:07.180+0000 7f33fed82640 20 register_request mgr=0x562760a56000 req_data->id=5345492, curl_handle=0x5627610089a0
  -519> 2024-04-04T16:21:07.180+0000 7f35287d5640 20 ERROR: msg->data.result=7 req_data->id=5345460 http_status=0
  -518> 2024-04-04T16:21:07.180+0000 7f35287d5640 20 ERROR: curl error: Couldn't connect to server req_data->error_buf=Failed to connect to localhost port 10474 after 0 ms: Connection refused
  -517> 2024-04-04T16:21:07.180+0000 7f35287d5640 20 ERROR: msg->data.result=7 req_data->id=5345461 http_status=0
  -516> 2024-04-04T16:21:07.180+0000 7f35287d5640 20 ERROR: curl error: Couldn't connect to server req_data->error_buf=Failed to connect to localhost port 10474 after 0 ms: Connection refused
  -515> 2024-04-04T16:21:07.180+0000 7f35287d5640 20 ERROR: msg->data.result=7 req_data->id=5345462 http_status=0
  -514> 2024-04-04T16:21:07.180+0000 7f33fed82640 20 sending request to http://localhost:10474
  -513> 2024-04-04T16:21:07.180+0000 7f35287d5640 20 ERROR: curl error: Couldn't connect to server req_data->error_buf=Failed to connect to localhost port 10474 after 0 ms: Connection refused

...
    -6> 2024-04-04T16:21:07.184+0000 7f33fed82640 20 register_request mgr=0x562760a56000 req_data->id=5345656, curl_handle=0x562763d82e80
    -5> 2024-04-04T16:21:07.184+0000 7f33fed82640 20 sending request to http://localhost:10474
    -4> 2024-04-04T16:21:07.184+0000 7f33fed82640 20 register_request mgr=0x562760a56000 req_data->id=5345657, curl_handle=0x56276337d3a0
    -3> 2024-04-04T16:21:07.184+0000 7f33fed82640 20 sending request to http://localhost:10474
    -2> 2024-04-04T16:21:07.184+0000 7f33fed82640 20 register_request mgr=0x562760a56000 req_data->id=5345658, curl_handle=0x56276347bf80
    -1> 2024-04-04T16:21:07.184+0000 7f35287d5640 20 unregister_request mgr=0x562760a56000 req_data->id=5345494, curl_handle=0
     0> 2024-04-04T16:21:07.188+0000 7f33fb55a640 -1 *** Caught signal (Segmentation fault) **
 in thread 7f33fb55a640 thread_name:safe_timer

 ceph version 19.0.0-2638-g6c2eea20 (6c2eea201c76ffcac93b7a9835b2ecce3eee2d92) squid (dev)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f352f7be520]
 2: radosgw(+0x898dfc) [0x56275dcc0dfc]
 3: radosgw(+0x398015) [0x56275d7c0015]
 4: (void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor>, rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#7}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)+0x55) [0x56275dcc6315]
 5: make_fcontext()
Actions #3

Updated by Casey Bodley 27 days ago

  • Subject changed from rgw: Segmentation fault in code related to watch/notify to rgw: Segmentation fault in rgw::notify::Manager during realm reload
Actions #4

Updated by Casey Bodley 22 days ago

i managed to reproduce under valgrind. this report of use-after-free looks relevant:

<error>
  <unique>0x2c8156</unique>
  <tid>1</tid>
  <kind>InvalidRead</kind>
  <what>Invalid read of size 4</what>
  <stack>
    <frame>
      <ip>0x65ABF74</ip>
      <obj>/usr/lib/x86_64-linux-gnu/libc.so.6</obj>
      <fn>pthread_mutex_lock@@GLIBC_2.2.5</fn>
      <dir>./nptl/./nptl</dir>
      <file>pthread_mutex_lock.c</file>
      <line>80</line>
    </frame>
    <frame>
      <ip>0x5DC54A</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0x640BB4</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>void boost::asio::execution::detail::any_executor_base::execute_ex&lt;boost::asio::strand&lt;boost::asio::io_context::basic_executor_type&lt;std::allocator&lt;void&gt;, 0ul&gt; &gt; &gt;(boost::asio::execution::detail::any_executor_base const&amp;, boost::asio::detail::executor_function&amp;&amp;)</fn>
    </frame>
    <frame>
      <ip>0x780102</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0x120BE39</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0x77F9AA</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0x9B6886</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>std::_Function_handler&lt;void (int), RGWPubSubKafkaEndpoint::send_to_completion_async(ceph::common::CephContext*, rgw_pubsub_s3_event const&amp;, optional_yield)::{lambda(int)#1}&gt;::_M_invoke(std::_Any_data const&amp;, int&amp;&amp;)</fn>
    </frame>
    <frame>
      <ip>0xAE5F4E</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>rgw::kafka::message_callback(rd_kafka_s*, rd_kafka_message_s const*, void*)</fn>
    </frame>
    <frame>
      <ip>0x5DE635E</ip>
      <obj>/usr/lib/x86_64-linux-gnu/librdkafka.so.1</obj>
    </frame>
    <frame>
      <ip>0x5E1CF79</ip>
      <obj>/usr/lib/x86_64-linux-gnu/librdkafka.so.1</obj>
    </frame>
    <frame>
      <ip>0x5DE1117</ip>
      <obj>/usr/lib/x86_64-linux-gnu/librdkafka.so.1</obj>
      <fn>rd_kafka_poll</fn>
    </frame>
    <frame>
      <ip>0x5DE4E8E</ip>
      <obj>/usr/lib/x86_64-linux-gnu/librdkafka.so.1</obj>
      <fn>rd_kafka_flush</fn>
    </frame>
    <frame>
      <ip>0xAE562B</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0xAE58EF</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0xAE5D77</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>rgw::kafka::shutdown()</fn>
    </frame>
    <frame>
      <ip>0x5C9E36</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>rgw::AppMain::shutdown(std::function&lt;void ()&gt;)</fn>
    </frame>
    <frame>
      <ip>0x58727D</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>main</fn>
    </frame>
  </stack>
  <auxwhat>Address 0x13d31cc60 is 16 bytes inside a block of size 40 free'd</auxwhat>
  <stack>
    <frame>
      <ip>0x484CD4F</ip>
      <obj>/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so</obj>
      <fn>operator delete[](void*, unsigned long)</fn>
    </frame>
    <frame>
      <ip>0x5D167C</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0x9A1C3C</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0x9CC69D</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>RGWRados::finalize()</fn>
    </frame>
    <frame>
      <ip>0x5C9C93</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>rgw::AppMain::shutdown(std::function&lt;void ()&gt;)</fn>
    </frame>
    <frame>
      <ip>0x58727D</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>main</fn>
    </frame>
  </stack>
  <auxwhat>Block was alloc'd at</auxwhat>
  <stack>
    <frame>
      <ip>0x484A2F3</ip>
      <obj>/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so</obj>
      <fn>operator new[](unsigned long)</fn>
    </frame>
    <frame>
      <ip>0x5E0389</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0x9B0850</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0x9B2EC2</ip>
      <obj>/usr/bin/radosgw</obj>
    </frame>
    <frame>
      <ip>0x9B2F34</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>void boost::context::detail::context_entry&lt;boost::context::detail::record&lt;boost::context::continuation, boost::context::basic_protected_fixedsize_stack&lt;boost::context::stack_traits&gt;, spawn::detail::spawn_helper&lt;boost::asio::executor_binder&lt;void (*)(), boost::asio::strand&lt;boost::asio::io_context::basic_executor_type&lt;std::allocator&lt;void&gt;, 0ul&gt; &gt; &gt;, rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RadosStore*, rgw::SiteConfig const&amp;)::{lambda(spawn::basic_yield_context&lt;boost::asio::executor_binder&lt;void (*)(), boost::asio::any_io_executor&gt; &gt;)#1}, boost::context::basic_protected_fixedsize_stack&lt;boost::context::stack_traits&gt; &gt;::operator()()::{lambda(boost::context::continuation&amp;&amp;)#1}&gt; &gt;(boost::context::detail::transfer_t)</fn>
    </frame>
    <frame>
      <ip>0x128C506</ip>
      <obj>/usr/bin/radosgw</obj>
      <fn>make_fcontext</fn>
    </frame>
  </stack>
</error>

from https://qa-proxy.ceph.com/teuthology/cbodley-2024-04-09_17:48:09-rgw:notifications-wip-yuval-63909-distro-default-smithi/7649345/remote/smithi148/log/valgrind/ceph.client.1.log.gz

Actions #5

Updated by Yuval Lifshitz 21 days ago

  • Tags set to notifications

the valgrind report indicates a crash during sutdown. when we shutdown the kafka manager, we destroy all connections, and if any of these connections has pending requests we flush them and wait (for sometime) for the replies.
this is handled (among other things) in this PR: https://github.com/ceph/ceph/pull/56033
specifically: https://github.com/ceph/ceph/pull/56033/files#diff-925da66a25513280414fbc7daa5d76d7f6f5ea1d243163c3b53dd3052acb4b32L94

however, this is a different than the issue of the realm reload crash.

Actions #6

Updated by Krunal Chheda 19 days ago

In our testing we are seeing the same crash, however we do not see it during the realm upload or shutdown.
Its just happening when notification is being tried to deliver to a deleted kafka endpoint.

rgw notify: WARNING: push entry marker: 0/823547 failed. error: -192 (will retry)

The logs are filled with retry messages as we have not set any ttl/max_retry.
#0  0x00007f72ed9ddf58 in pthread_getname_np () from /lib64/libpthread.so.0
#1  0x00007f72f105b928 in ceph::logging::Log::dump_recent() () from /usr/lib64/ceph/libceph-common.so.2
#2  0x0000557b127ab566 in handle_oneshot_fatal_signal(int) ()
#3  <signal handler called>
#4  0x0000557b121da5d8 in rgw::notify::Manager::tokens_waiter::token::~token() ()
#5  0x0000557b121e3a68 in rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#7}::operator()(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >) const ()
#6  0x0000557b121e3c85 in spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#7}, boost::context::basic_protected_fixedsize_stack<rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#7}::stack_traits> >::operator()()::{lambda(rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#7}::continuation&&)#1}::operator()(rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#7}::continuation) const ()
#7  0x0000557b121e3ec3 in void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#7}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t) ()
#8  0x0000557b12803a7f in make_fcontext ()

@Yuval Lifshitz you think this is related to kafka destroying connection and flushing messages ?

Actions #7

Updated by Krunal Chheda 19 days ago ยท Edited

@Yuval Lifshitz the crash issue with kafka is all about the conn->destroyed being called while publish_internal() might be processing the connection?
is there any other race condition that could cause the crash ?

Actions #8

Updated by Krunal Chheda 19 days ago

the crash during the realm reload is due to connection being destroyed while its in use,
we call `kafka::shutdown` during the realm reload, which destroys the connections async and if during that time if any events are being tried to publish via `publish_internal` & if its is referring to one of the connection that was erased, then it could cause crash.
so before clearing out the connections, we need to ensure all the messages from queue have processed by `publish_internal` else we will always be susceptible to these crashes

Actions #9

Updated by Casey Bodley 13 days ago

  • Status changed from New to Fix Under Review
  • Backport set to squid
  • Pull request ID set to 56979
Actions

Also available in: Atom PDF