Project

General

Profile

Actions

Bug #63373

open

multisite: Deadlock in RGWDeleteMultiObj with default rgw_multi_obj_del_max_aio > 1

Added by Abhishek Lekshmanan 6 months ago. Updated about 1 month ago.

Status:
Pending Backport
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
fifo backport_processed
Backport:
reef squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On running a multisite setup with Reef (18.2.0) we run into this situation where
the client facing RGWs (the ones with rgw run sync thread = false) deadlock
after running a warp s3 benchmark with around 10 clients. The deadlock is
reproducible even when running rgw run sync thread = true is set. On inspecting
one of the RGWs with eu-stack we see that all of the RGW threads are blocked at
`RGWDataChangesLog::add_entry`, we see mainly the following 2 patterns in the active threads

#0  0x00007f9ba0a7d82d __lll_lock_wait
#1  0x00007f9ba0a76ad9 __pthread_mutex_lock
#2  0x000055d9caf70e4b LazyFIFO::lazy_init(DoutPrefixProvider const*, optional_yield)
#3  0x000055d9caf79968 RGWDataChangesFIFO::push(DoutPrefixProvider const*, int, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long...(=)
#4  0x000055d9caf6c938 RGWDataChangesLog::add_entry(DoutPrefixProvider const*, RGWBucketInfo const&, rgw::bucket_log_layout_generation const&, int, optional_yield)
#5  0x000055d9cac8cba5 add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield)
#6  0x000055d9cac9da60 RGWRados::Bucket::UpdateIndex::complete(...)
#8  0x000055d9cacc4e67 RGWRados::Object::Write::write_meta(...)

And this state

#0  0x00007f9ba0a7a45c pthread_cond_wait@@GLIBC_2.3.2
#1  0x00007f9b9ff35860 std::condition_variable::wait(std::unique_lock<std::mutex>&)
#2  0x000055d9caf6c323 RGWDataChangesLog::add_entry(DoutPrefixProvider const*, RGWBucketInfo const&, rgw::bucket_log_layout_generation const&, int, optional_yield)
#3  0x000055d9cac8cba5 add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield)
#4  0x000055d9cac9da60 RGWRados::Bucket::UpdateIndex::complete()

On a more detailed analysis via GDB on the lazy init threads we see

#0  0x00007f9ba0a7d82d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007f9ba0a76ad9 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x000055d9caf70e4b in __gthread_mutex_lock (__mutex=0x55d9cc496de8)
    at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/x86_64-redhat-linux/bits/gthr-default.h:749
#3  std::mutex::lock (this=0x55d9cc496de8) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/std_mutex.h:100
#4  std::unique_lock<std::mutex>::lock (this=0x7f9a54644ec0) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_lock.h:139
#5  std::unique_lock<std::mutex>::unique_lock (__m=..., this=0x7f9a54644ec0) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_lock.h:69
#6  LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d897c000, y=...)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
#7  0x000055d9caf79968 in LazyFIFO::push (y=..., bl=..., dpp=0x55d9d897c000, this=0x55d9cc496dc0)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:305
#8  RGWDataChangesFIFO::push (this=0x55d9cd28d300, dpp=0x55d9d897c000, index=40, bl=..., y=...)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_datalog.cc:286
#9  0x000055d9caf6c938 in RGWDataChangesLog::add_entry (this=0x55d9cc546200, dpp=0x55d9d897c000, bucket_info=..., gen=..., 
    shard_id=<optimized out>, y=...)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:197
#10 0x000055d9cac8cba5 in add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield) ()
    at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/stl_iterator.h:1083

Inspecting the mutex we see

(gdb) f 6
#6  LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d897c000, y=...)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
264    /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h: No such file or directory.
(gdb) p this
$77 = (LazyFIFO * const) 0x55d9cc496dc0
(gdb) p this->m
$78 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 286845, __nusers = 1, __kind = 0, __spins = 0, 
        __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\000\000\000\000}`\004\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}

Looking at the thread corresponding to the owner of the mutex

(gdb) thread 261
[Switching to thread 261 (Thread 0x7f9b16bde700 (LWP 286845))]
#0  0x00007f9ba0a7a45c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
(gdb) bt
#0  0x00007f9ba0a7a45c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x00007f9b9ff35860 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from target:/lib64/libstdc++.so.6
#2  0x000055d9caf6c323 in ceph::common::RefCountedCond::wait (this=0x55d9d044d400)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/common/RefCountedObj.h:111
#3  RGWDataChangesLog::add_entry (this=0x55d9cc546200, dpp=<optimized out>, bucket_info=..., gen=..., shard_id=<optimized out>, y=...)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_datalog.cc:689
#4  0x000055d9cac8cba5 in add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield) ()
    at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/stl_iterator.h:1083
#5  0x000055d9cac9dcd8 in RGWRados::Bucket::UpdateIndex::complete_del(DoutPrefixProvider const*, long, unsigned long, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >&, std::__cxx11::list<rgw_obj_index_key, std::allocator<rgw_obj_index_key> >*, optional_yield) () at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_rados.cc:6374
#6  0x000055d9cacb041b in RGWRados::Object::Delete::delete_obj(optional_yield, DoutPrefixProvider const*) ()
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_rados.cc:5347
#7  0x000055d9cad18942 in rgw::sal::RadosObject::RadosDeleteOp::delete_obj (this=0x55d9d0b39a00, dpp=0x55d9d003a6c0, y=...)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_sal_rados.cc:2273
#8  0x000055d9caa3f859 in RGWDeleteMultiObj::handle_individual_object(rgw_obj_key const&, optional_yield, boost::asio::basic_deadline_timer<boost::posix_time::ptime, boost::asio::time_traits<boost::posix_time::ptime>, boost::asio::executor>*) ()
    at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:421
#9  0x000055d9caa404f0 in void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, RGWDeleteMultiObj::execute(optional_yield)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#3}, boost::context::basic_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t) () at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/common/async/yield_context.h:40
#10 0x000055d9cb2860af in make_fcontext ()
#11 0x0000000000000000 in ?? ()

Randomly looking at most threads blocked in lazy_init, it looks like the owner of mutex still to be thread 286845

(gdb) thread 596
[Switching to thread 596 (Thread 0x7f9a6f28f700 (LWP 287180))]
#0  0x00007f9ba0a7d82d in __lll_lock_wait () from target:/lib64/libpthread.so.0
(gdb) f 6
#6  LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d31dab00, y=...)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
264    /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h: No such file or directory.
(gdb) p this->m
$79 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 286845, __nusers = 1, __kind = 0, __spins = 0, 
        __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\000\000\000\000}`\004\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}
(gdb) thread 595
[Switching to thread 595 (Thread 0x7f9a6fa90700 (LWP 287179))]
#0  0x00007f9ba0a7d82d in __lll_lock_wait () from target:/lib64/libpthread.so.0
(gdb) f 6
#6  LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d680c200, y=...)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
264    in /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h
(gdb) p this->m
$80 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 286845, __nusers = 1, __kind = 0, __spins = 0, 
        __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\000\000\000\000}`\004\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}
(gdb) thread 589
[Switching to thread 589 (Thread 0x7f9a72a96700 (LWP 287173))]
#0  0x00007f9ba0a7d82d in __lll_lock_wait () from target:/lib64/libpthread.so.0
(gdb) f 6 
#6  LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d4462200, y=...)
    at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
264    in /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h
(gdb) p this->m
$81 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 286845, __nusers = 1, __kind = 0, __spins = 0, 
        __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\000\000\000\000}`\004\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}

Is this by any chance fixed in main already? What steps can we help to debug this problem further?


Related issues 3 (1 open2 closed)

Has duplicate rgw - Bug #63528: rgw recursive delete deadlockDuplicate

Actions
Copied to rgw - Backport #64500: reef: multisite: Deadlock in RGWDeleteMultiObj with default rgw_multi_obj_del_max_aio > 1In ProgressCasey BodleyActions
Copied to rgw - Backport #64501: squid: multisite: Deadlock in RGWDeleteMultiObj with default rgw_multi_obj_del_max_aio > 1ResolvedCasey BodleyActions
Actions #1

Updated by Abhishek Lekshmanan 6 months ago

  • Description updated (diff)
Actions #2

Updated by Casey Bodley 6 months ago

  • Assignee set to Adam Emerson
  • Tags set to fifo
Actions #3

Updated by Adam Emerson 6 months ago

How reliably reproducible is this? I'm not sure how it can be holding the LazyFIFO mutex in add_entry when the mutex is private to LazyFIFO and acquired in lazy_init with a unique_lock unless there's some memory corruption or something strange happening.

Actions #4

Updated by Casey Bodley 6 months ago

  • Has duplicate Bug #63528: rgw recursive delete deadlock added
Actions #5

Updated by Gabriel Adrian Samfira 6 months ago

In our setup this is easily reproduced by just running:

s3cmd rm -r s3://some-large-bucket

We also have multisite-sync enabled with the cloud-sync module. But you can repro this even without multi-site enabled by running a few warp commands in parallel and attempting to remove a large bucket (several thousand objects).

With multisite-sync enabled, any bulk delete operation will deadlock. At least it does for us.

To work around this, we set rgw_multi_obj_del_max_aio = 1. Removing multiple objects one at a time works well:

s3cmd rm s3://some-large-bucket/*

Actions #6

Updated by Enrico Bocchi 6 months ago

We also suffer from the same deletion problem with large buckets reported above; Signature is identical.

Actions #7

Updated by Casey Bodley 6 months ago

another report in https://tracker.ceph.com/issues/63528

seems to be specific to the RGWDeleteMultiObj op when rgw_multi_obj_del_max_aio > 1. we spawn several coroutines to perform concurrent deletes, and those are blocking in the RGWDataChangesLog

Actions #8

Updated by Casey Bodley 6 months ago

  • Subject changed from Deadlock in RGW while request processing to multisite: Deadlock in RGWDeleteMultiObj with default rgw_multi_obj_del_max_aio > 1
Actions #9

Updated by Adam Emerson 6 months ago

  • Status changed from New to Fix Under Review
  • Backport set to reef
  • Pull request ID set to 54505
Actions #10

Updated by Enrico Bocchi 5 months ago

Just built reef 18.2.1 with the patch on top, but does not seem to fix the problem.
Looking further...

Actions #11

Updated by Enrico Bocchi 5 months ago

Pattern leading to the deadlock in 18.2.1 and patch has not changed:

#0  0x00007ff85bf6d45c pthread_cond_wait@@GLIBC_2.3.2
#1  0x00007ff85b428870 std::condition_variable::wait(std::unique_lock<std::mutex>&)
#2  0x000055f3444ece0b RGWDataChangesLog::add_entry(DoutPrefixProvider const*, RGWBucketInfo const&, rgw::bucket_log_layout_generation const&, int, optional_yield)
#3  0x000055f344203565 add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield)
#4  0x000055f344214698 RGWRados::Bucket::UpdateIndex::complete_del(DoutPrefixProvider const*, long, unsigned long, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000
l> > >&, std::__cxx11::list<rgw_obj_index_key, std::allocator<rgw_obj_index_key> >*, optional_yield)
#5  0x000055f344226e4a RGWRados::Object::Delete::delete_obj(optional_yield, DoutPrefixProvider const*)
#6  0x000055f344290c02 rgw::sal::RadosObject::RadosDeleteOp::delete_obj(DoutPrefixProvider const*, optional_yield)
#7  0x000055f343fb04e9 RGWDeleteMultiObj::handle_individual_object(rgw_obj_key const&, optional_yield, boost::asio::basic_deadline_timer<boost::posix_time::ptime, boost::asio::time_traits<boost::posix_time::ptime>, boost::asio::executor>*)
#8  0x000055f343fb1180 void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, RGWDeleteMultiObj::execute(optional_yield)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#3}, boost::context::basic_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)
#9  0x000055f344806b5f make_fcontext

When hitting the deadlock, other operations leading to add_datalog_entry on the same shard will be blocked as well.
This seems to be the case for the ones requiring RGWRados::Bucket::UpdateIndex::complete or RGWRados::Bucket::UpdateIndex::complete_del, so primarily PUTs and DELETEs. Reads (GET, HEAD) are not impacted.

An example with PUT:

#0  0x00007ff85bf6d45c pthread_cond_wait@@GLIBC_2.3.2
#1  0x00007ff85b428870 std::condition_variable::wait(std::unique_lock<std::mutex>&)
#2  0x000055f3444ece0b RGWDataChangesLog::add_entry(DoutPrefixProvider const*, RGWBucketInfo const&, rgw::bucket_log_layout_generation const&, int, optional_yield)
#3  0x000055f344203565 add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield)
#4  0x000055f344214420 RGWRados::Bucket::UpdateIndex::complete(DoutPrefixProvider const*, long, unsigned long, unsigned long, unsigned long, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*, RGWObjCategory, std::__cxx11::list<rgw_obj_index_key, std::allocator<rgw_obj_index_key> >*, optional_yield, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, bool)
#5  0x000055f34423ab5e RGWRados::Object::Write::_do_write_meta(DoutPrefixProvider const*, unsigned long, unsigned long, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >&, bool, bool, void*, optional_yield)
#6  0x000055f34423bba7 RGWRados::Object::Write::write_meta(DoutPrefixProvider const*, unsigned long, unsigned long, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >&, optional_yield)
#7  0x000055f3445bf9a6 rgw::putobj::AtomicObjectProcessor::complete(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, rgw_zone_set*, bool*, optional_yield)
#8  0x000055f343fd3a6a RGWPutObj::execute(optional_yield)
#9  0x000055f343dbfe9c rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)
#10 0x000055f343dc35b4 process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)
#11 0x000055f343cff106 void (anonymous namespace)::handle_connection<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >(boost::asio::io_context&, RGWProcessEnv&, boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >&, rgw::basic_timeout_timer<ceph::coarse_mono_clock, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, (anonymous namespace)::Connection>&, unsigned long, boost::beast::flat_static_buffer<65536ul>&, bool, ceph::async::SharedMutex<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >&, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::system::error_code&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >) [clone .constprop.0]
#12 0x000055f343cffdd4 void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#4}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)
#13 0x000055f344806b5f make_fcontext

Actions #12

Updated by Adam Emerson 5 months ago

We could just get rid of LazyFIFO. It speeds startup time of radosgw-admin, but isn't required for any kind of correctness.

Alternatively we could switch from a mutex to a strand, perhaps.

Actions #13

Updated by Mark Kogan 5 months ago

@ Abhishek Hi, attempted to repro on 18.2.1 and main MS environments using warp without success
using the warp command as below:

git branch -vv
* (HEAD detached at v18.2.1) ef353a10cdb 18.2.1

warp delete --host=127.0.0.1:8000 --access-key=b2345678901234567890 --secret-key=b234567890123456789012345678901234567890 --concurrent $(numactl -N 1 -- nproc) --obj.size 4K --objects 500000 --batch 50 --duration=10m
warp: Benchmark data written to "warp-delete-2023-12-10[101336]-tUaK.csv.zst"                                                                                                                

----------------------------------------
Operation: PUT
* Average: 12.63 MiB/s, 3312.01 obj/s

Throughput, split into 150 x 1s:
 * Fastest: 42.4MiB/s, 11111.44 obj/s
 * 50% Median: 10.1MiB/s, 2654.35 obj/s
 * Slowest: 6.9MiB/s, 1819.18 obj/s

----------------------------------------
Operation: DELETE
* Average: 2961.04 obj/s

Throughput, split into 167 x 1s:
 * Fastest: 3176.53 obj/s
 * 50% Median: 2967.04 obj/s
 * Slowest: 2544.90 obj/s
warp: Cleanup Done.%   

ls -b1tr ./out/radosgw.8000*.asok | xargs -i sudo ./bin/ceph --admin-daemon {} config show | grep -E  "rgw_multi_obj_del_max_aio" 
    "rgw_multi_obj_del_max_aio": "16",

for attempting to repro asking please to paste the warp commands used to repro in comment #5
to use the same `--concurrent` & `--batch`
also comment #5 states: `running a few warp commands in parallel` - how many in parallel and were thy operating on the same bucket?

Thanks

Actions #14

Updated by Casey Bodley 5 months ago

  • Status changed from Fix Under Review to In Progress
  • Pull request ID deleted (54505)
Actions #15

Updated by Enrico Bocchi 5 months ago

Hi Mark,

Interestingly, `warp delete` does not seem to suffer from this issue -- I have reproduced what you reported above.
However, the deadlock is easily reproducible with `s3cmd` and `aws s3` when `rgw_multi_obj_del_max_aio` is set to anything greater than 1.

They both (s3cmd, aws s3) implement multiple deletions by submitting a POST request with the list of objects to be deleted as the body in XML format. I'll dig deeper looking for differences between s3cmd/awscli and warp, as also warp seems to be using POST with /?delete.

Actions #16

Updated by Mark Kogan 4 months ago

Hi Enrico, thanks for detailing the reproducer.

In my checks it DOES repro on 18.2.x and current main (74ad7ff7)
with and without https://github.com/ceph/ceph/pull/54505 -- rgw/multisite: Fix deadlock by not holding mutex over coroutine

on a setup - pri: 1 client rgw (:8000) + 3 syncing rgws X sec: 3 syncing rgws + 1 client rgw
workload - 100000 x 4KB obj PUTs and deletes on 1 bucket were executed to the pri client rgw

s3cmd delete is stuck re-trying:

❯ s3cmd rm s3://b01b-000000000000/folder01/ --recursive
WARNING: Retrying failed request: /?delete (timed out)
WARNING: Waiting 3 sec...
WARNING: Retrying failed request: /?delete (timed out)
WARNING: Waiting 6 sec...
WARNING: Retrying failed request: /?delete (timed out)
WARNING: Waiting 9 sec...
WARNING: Retrying failed request: /?delete (timed out)
WARNING: Waiting 12 sec...
WARNING: Retrying failed request: /?delete (timed out)
WARNING: Waiting 15 sec...
ERROR: S3 Temporary Error: Request failed for: /?delete.  Please try again later.

log exert w/ debug_rgw=20 & debug_ms=1 , of note: grep -E "multi_object_delete|rgw.obj_remove" :

cat ./out/radosgw.8000.log | grep -E "multi_object_delete|rgw.obj_remove" 

2023-12-17T13:28:31.431+0000 7fff6d2f2640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete get_obj_state: setting s->obj_tag to 34020929-7cd8-4648-94cc-2841b675c82a.5874.1147639405654198131
2023-12-17T13:28:31.431+0000 7fff6d2f2640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete get_obj_state: rctx=0x55555fbc8310 obj=b01b-000000000000:folder01/stage01_000000000031 state=0x55555bbd51e8 s->prefetch_data=0
2023-12-17T13:28:31.431+0000 7fff6d2f2640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete get_obj_state: rctx=0x55555fbc8310 obj=b01b-000000000000:folder01/stage01_000000000031 state=0x55555bbd51e8 s->prefetch_data=0
2023-12-17T13:28:31.431+0000 7fff6d2f2640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete  bucket index object: us-east.rgw.buckets.index:.dir.34020929-7cd8-4648-94cc-2841b675c82a.5878.1.10
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303357 11.26 11:65d34d1a:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000017:head [cmpxattr user.rgw.idtag (63) op 1 mode 1 in=77b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555a715400 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303358 11.41 11:83939257:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000016:head [cmpxattr user.rgw.idtag (63) op 1 mode 1 in=77b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x5555610a7c00 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303359 11.40 11:03de76b7:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000018:head [cmpxattr user.rgw.idtag (62) op 1 mode 1 in=76b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555d95e800 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303360 11.71 11:8f545594:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000019:head [cmpxattr user.rgw.idtag (63) op 1 mode 1 in=77b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555a599000 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303361 11.5b 11:dacb4110:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000021:head [cmpxattr user.rgw.idtag (62) op 1 mode 1 in=76b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555df88400 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303362 11.75 11:af001210:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000020:head [cmpxattr user.rgw.idtag (63) op 1 mode 1 in=77b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555a782400 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303363 11.7 11:e16bc8e0:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000022:head [cmpxattr user.rgw.idtag (62) op 1 mode 1 in=76b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555e18b400 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303364 11.37 11:ec399a60:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000023:head [cmpxattr user.rgw.idtag (62) op 1 mode 1 in=76b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555a715800 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303365 11.5e 11:7a983dbf:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000024:head [cmpxattr user.rgw.idtag (63) op 1 mode 1 in=77b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555ba84400 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303366 11.21 11:841edaff:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000025:head [cmpxattr user.rgw.idtag (63) op 1 mode 1 in=77b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x5555611cf800 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff6d2f2640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303367 11.41 11:82368580:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000027:head [cmpxattr user.rgw.idtag (63) op 1 mode 1 in=77b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555a6cbc00 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff8f336640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303368 11.74 11:2ffbd600:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000026:head [cmpxattr user.rgw.idtag (63) op 1 mode 1 in=77b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555dcab800 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff8f336640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303369 11.1 11:80ed7f99:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000030:head [cmpxattr user.rgw.idtag (62) op 1 mode 1 in=76b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555ad63800 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff8f336640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303370 11.d 11:b1a4a109:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000028:head [cmpxattr user.rgw.idtag (63) op 1 mode 1 in=77b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x555560a8a400 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff8f336640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303371 11.67 11:e7b8b0c6:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000031:head [cmpxattr user.rgw.idtag (62) op 1 mode 1 in=76b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555fca4000 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff8f336640  1 -- 172.21.5.103:0/1187303167 --> [v2:172.21.5.103:6802/528533,v1:172.21.5.103:6803/528533] -- osd_op(unknown.0.0:303372 11.2a 11:55cd6689:::34020929-7cd8-4648-94cc-2841b675c82a.5878.1_folder01%2fstage01_000000000029:head [cmpxattr user.rgw.idtag (62) op 1 mode 1 in=76b,call rgw.obj_remove in=40b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41) v8 -- 0x55555a714c00 con 0x555557ec5c00
2023-12-17T13:28:31.431+0000 7fff8f336640 10 req 11721858080688368411 0.005000011s s3:multi_object_delete cache get: name=us-east.rgw.log++bucket.sync-source-hints.b01b-000000000000 : hit (negative entry)
2023-12-17T13:28:31.431+0000 7fff8f336640 10 req 11721858080688368411 0.005000011s s3:multi_object_delete cache get: name=us-east.rgw.log++bucket.sync-target-hints.b01b-000000000000 : hit (negative entry)
2023-12-17T13:28:31.431+0000 7fff8f336640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete reflect(): flow manager (bucket=b01b-000000000000:34020929-7cd8-4648-94cc-2841b675c82a.5878.1): adding source pipe: {s={b=b01b-000000000000:34020929-7cd8-4648-94cc-2841b675c82a.5878.1,z=e4becb01-7ad3-494d-92c1-d53fc87d53ff,az=0},d={b=b01b-000000000000:34020929-7cd8-4648-94cc-2841b675c82a.5878.1,z=34020929-7cd8-4648-94cc-2841b675c82a,az=0}}
2023-12-17T13:28:31.431+0000 7fff8f336640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete reflect(): flow manager (bucket=b01b-000000000000:34020929-7cd8-4648-94cc-2841b675c82a.5878.1): adding dest pipe: {s={b=b01b-000000000000:34020929-7cd8-4648-94cc-2841b675c82a.5878.1,z=34020929-7cd8-4648-94cc-2841b675c82a,az=0},d={b=b01b-000000000000:34020929-7cd8-4648-94cc-2841b675c82a.5878.1,z=e4becb01-7ad3-494d-92c1-d53fc87d53ff,az=0}}
2023-12-17T13:28:31.431+0000 7fff8f336640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete reflect(): flow manager (bucket=): adding source pipe: {s={b=*,z=e4becb01-7ad3-494d-92c1-d53fc87d53ff,az=0},d={b=*,z=34020929-7cd8-4648-94cc-2841b675c82a,az=0}}
2023-12-17T13:28:31.431+0000 7fff8f336640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete reflect(): flow manager (bucket=): adding dest pipe: {s={b=*,z=34020929-7cd8-4648-94cc-2841b675c82a,az=0},d={b=*,z=e4becb01-7ad3-494d-92c1-d53fc87d53ff,az=0}}
2023-12-17T13:28:31.431+0000 7fff8f336640 10 req 11721858080688368411 0.005000011s s3:multi_object_delete chain_cache_entry: cache_locator=
2023-12-17T13:28:31.431+0000 7fff8f336640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete chain_cache_entry: couldn't find cache locator
2023-12-17T13:28:31.431+0000 7fff8f336640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete couldn't put bucket_sync_policy cache entry, might have raced with data changes
2023-12-17T13:28:31.431+0000 7fff8f336640 20 req 11721858080688368411 0.005000011s s3:multi_object_delete RGWDataChangesLog::add_entry() bucket.name=b01b-000000000000 shard_id=0 now=2023-12-17T13:28:31.432844+0000 cur_expiration=2023-12-17T13:22:54.300039+0000

and also noting that the MS sync is stuck:

Actions #17

Updated by Enrico Bocchi 4 months ago

Hi Mark,


In my checks it DOES repro on 18.2.x and current main (74ad7ff7)
with and without https://github.com/ceph/ceph/pull/54505 -- rgw/multisite: Fix deadlock by not holding mutex over coroutine

This matches precisely our observations.

If you check the usage field of the bucket stats (watch "radosgw-admin bucket stats --bucket=b01b-000000000000 | jq -r .usage"), you will notice some progress on deletions is made immediately after the s3cmd rm --recursive is issued (it deletes a handful of objects). Then everything locks up, including the MS sync.
Following retries when s3cmd times out won't bring any progress on deletions.
To bring things back, one has to restart all the RGWs -- both syncing and client.

The delete issue has been reported at https://tracker.ceph.com/issues/63528, which has then been marked as duplicate and folded here.
The workaround with rgw_multi_obj_del_max_aio = 1 still holds.

Actions #18

Updated by Adam Emerson 4 months ago

Actions #19

Updated by Mark Kogan 4 months ago

@Adam DC949 Emerson
still reproduces with the new version of https://github.com/ceph/ceph/pull/54505
(for reference commit 8fa844383f9c22e758f39ecdda74f70de054ad68)

at the time of writing the most reliable repro flow (still not 100% yet) is:

1/ vstart.sh ... -o rgw_multi_obj_del_max_aio=128
2/ hsbench put 100K objects to primary:

    ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 1 )) -b 1 -n 100000 -m cxip -bp b01b- -op 'folder01/stage01_'

2/ WHILE SYNCHING: s3cmd delete some successfully:
   s3cmd rm --recursive s3://b01b-000000000000/folder01/
    ...
    delete: 's3://b01b-000000000000/folder01/stage01_000000007996'
    delete: 's3://b01b-000000000000/folder01/stage01_000000007997'
    delete: 's3://b01b-000000000000/folder01/stage01_000000007998'
    delete: 's3://b01b-000000000000/folder01/stage01_000000007999'

<ctrl-c> after about 3 seconds
3/ wait until caought-up on secondary
   ./bin/radosgw-admin sync status ; ./bin/radosgw-admin bucket sync status --bucket=test-100m-1000000000000

4/ s3cmd to delete some more will get stuck (and sync status also will not converge to caught-up)
   s3cmd rm --recursive s3://b01b-000000000000/folder01/
    WARNING: Retrying failed request: /?delete (timed out)
    WARNING: Waiting 3 sec...
    WARNING: Retrying failed request: /?delete (IncompleteRead(1693 bytes read))
    WARNING: Waiting 6 sec...
    ERROR: [Errno 111] Connection refused

Actions #20

Updated by Mark Kogan 4 months ago

updating that git bisected the issue - it begins reproducing with this PR:
https://github.com/ceph/ceph/pull/49741 -- rgw: Remove (most) uses of null_yield in datalog #49741

Actions #21

Updated by Christian Rohmann 3 months ago

We run a multisite replicated setup with 3 RGWs on each side and just starting having issues after the upgraded from Quincy to Reef.
One RGW stopped serving regular traffic. After restarting it, we observed the same issue on another one.

The journal showed log entries the likes of:
(with a different buckets in line)

Jan 23 13:52:48 server-02 systemd[1]: Started Ceph rados gateway.
Jan 23 13:59:57 server-02 radosgw[2764335]: 2024-01-23T13:59:57.565+0000 7ff779469640 -1 req 14897416852809496998 0.016000021s :get_bucket_index_log_status ERROR: rgw_read_bucket_full_sync_status() on pipe={s={somebucketname,z=401ad44b-67fb-4693-9412-bd6a4ddeafdb,az=0},d={somebucketname,z=06f9b7c7-6326-4a41-9115-d4d092cf74ce,az=0}} returned ret=-2
Jan 23 14:19:56 server-02 radosgw[2764335]: 2024-01-23T14:19:56.889+0000 7ff7d351d640 -1 req 2217699481258830340 0.023999969s :get_bucket_index_log_status ERROR: rgw_read_bucket_full_sync_status() on pipe={s={somebucketname,z=401ad44b-67fb-4693-9412-bd6a4ddeafdb,az=0},d={somebucketname,z=06f9b7c7-6326-4a41-9115-d4d092cf74ce,az=0}} returned ret=-2
Jan 23 14:39:56 server-02 radosgw[2764335]: 2024-01-23T14:39:56.661+0000 7ff76f455640 -1 req 14570640612393349190 0.016000040s :get_bucket_index_log_status ERROR: rgw_read_bucket_full_sync_status() on pipe={s={somebucketname,z=401ad44b-67fb-4693-9412-bd6a4ddeafdb,az=0},d={somebucketname,z=06f9b7c7-6326-4a41-9115-d4d092cf74ce,az=0}} returned ret=-2
Jan 23 14:59:57 server-02 radosgw[2764335]: 2024-01-23T14:59:57.346+0000 7ff76d451640 -1 req 1736920273144919839 0.028000010s :get_bucket_index_log_status ERROR: rgw_read_bucket_full_sync_status() on pipe={s={somebucketname,z=401ad44b-67fb-4693-9412-bd6a4ddeafdb,az=0},d={somebucketname,z=06f9b7c7-6326-4a41-9115-d4d092cf74ce,az=0}} returned ret=-2
Jan 23 14:59:57 server-02 radosgw[2764335]: 2024-01-23T14:59:57.510+0000 7ff7aa4cb640 -1 req 2256559195600881305 0.008000002s :get_bucket_index_log_status ERROR: rgw_read_bucket_full_sync_status() on pipe={s={somebucketname,z=401ad44b-67fb-4693-9412-bd6a4ddeafdb,az=0},d={somebucketname,z=06f9b7c7-6326-4a41-9115-d4d092cf74ce,az=0}} returned ret=-2
Jan 23 14:59:57 server-02 radosgw[2764335]: 2024-01-23T14:59:57.658+0000 7ff828dc8640 -1 req 2901937952550983620 0.020000005s :get_bucket_index_log_status ERROR: rgw_read_bucket_full_sync_status() on pipe={s={somebucketname,z=401ad44b-67fb-4693-9412-bd6a4ddeafdb,az=0},d={somebucketname,z=06f9b7c7-6326-4a41-9115-d4d092cf74ce,az=0}} returned ret=-2
Jan 23 15:19:57 server-02 radosgw[2764335]: 2024-01-23T15:19:57.431+0000 7ff75341d640 -1 req 17380475824622782515 0.012000011s :get_bucket_index_log_status ERROR: rgw_read_bucket_full_sync_status() on pipe={s={somebucketname,z=401ad44b-67fb-4693-9412-bd6a4ddeafdb,az=0},d={somebucketname,z=06f9b7c7-6326-4a41-9115-d4d092cf74ce,az=0}} returned ret=-2
Jan 23 15:19:57 server-02 radosgw[2764335]: 2024-01-23T15:19:57.511+0000 7ff777465640 -1 req 17877785926432276237 0.028000029s :get_bucket_index_log_status ERROR: rgw_read_bucket_full_sync_status() on pipe={s={somebucketname,z=401ad44b-67fb-4693-9412-bd6a4ddeafdb,az=0},d={somebucketname,z=06f9b7c7-6326-4a41-9115-d4d092cf74ce,az=0}} returned ret=-2

There was no more activity / log happening in ceph-client.rgw.*.log anymore.

After ranking up the log level on this broken RGW instance I saw:

[...]
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 run: stack=0x56471e95cf00 is io blocked
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x5647203a48c0:op=0x56471b611100:20RGWSimpleRadosLockCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x5647203a48c0:op=0x56471b611100:20RGWSimpleRadosLockCR: operate() returned r=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x5647203a48c0:op=0x564723d68700:20RGWContinuousLeaseCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 cr:s=0x5647203a48c0:op=0x564723d68700:20RGWContinuousLeaseCR: couldn't lock fra-obst-az1.rgw.log:datalog.sync-status.shard.401ad44b-67fb-4693-9412-bd6a4ddeafdb.48:sync_lock: retcode=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x5647203a48c0:op=0x564723d68700:20RGWContinuousLeaseCR: operate() returned r=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 15 stack 0x5647203a48c0 end
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: stack->operate() returned ret=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 run: stack=0x5647203a48c0 is done
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x564724b11e00:op=0x56473d1dfc00:20RGWSimpleRadosLockCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x564724b11e00:op=0x56473d1dfc00:20RGWSimpleRadosLockCR: operate() returned r=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3ea00:op=0x56473f08c000:18RGWDataSyncShardCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x564724b11e00:op=0x564723e2f100:20RGWContinuousLeaseCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 cr:s=0x564724b11e00:op=0x564723e2f100:20RGWContinuousLeaseCR: couldn't lock fra-obst-az1.rgw.log:datalog.sync-status.shard.401ad44b-67fb-4693-9412-bd6a4ddeafdb.61:sync_lock: retcode=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x564724b11e00:op=0x564723e2f100:20RGWContinuousLeaseCR: operate() returned r=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 15 stack 0x564724b11e00 end
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: stack->operate() returned ret=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 run: stack=0x564724b11e00 is done
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3ea00:op=0x56473f08c000:18RGWDataSyncShardCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 collect(): s=0x56471ec3ea00 stack=0x5647203a48c0 encountered error (r=-16), skipping next stacks
2024-01-23T16:18:39.483+0000 7fd1b9155640 10 collect() returned ret=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3fa40:op=0x564727a5c000:18RGWDataSyncShardCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3ea00:op=0x56473f08c000:18RGWDataSyncShardCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3ea00:op=0x56473f08c000:18RGWDataSyncShardCR: operate() returned r=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3fa40:op=0x564727a5c000:18RGWDataSyncShardCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 collect(): s=0x56471ec3fa40 stack=0x564724b11e00 encountered error (r=-16), skipping next stacks
2024-01-23T16:18:39.483+0000 7fd1b9155640 10 collect() returned ret=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3ea00:op=0x56471ec42a00:25RGWDataSyncShardControlCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 run: stack=0x56471ec3ea00 is io blocked
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3fa40:op=0x564727a5c000:18RGWDataSyncShardCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3fa40:op=0x564727a5c000:18RGWDataSyncShardCR: operate() returned r=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56472270e640:op=0x56473e64fc00:20RGWSimpleRadosLockCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56472270e640:op=0x56473e64fc00:20RGWSimpleRadosLockCR: operate() returned r=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3fa40:op=0x56471ec54700:25RGWDataSyncShardControlCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 run: stack=0x56471ec3fa40 is io blocked
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56472270e640:op=0x564720dd1500:20RGWContinuousLeaseCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 cr:s=0x56472270e640:op=0x564720dd1500:20RGWContinuousLeaseCR: couldn't lock fra-obst-az1.rgw.log:datalog.sync-status.shard.401ad44b-67fb-4693-9412-bd6a4ddeafdb.57:sync_lock: retcode=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56472270e640:op=0x564720dd1500:20RGWContinuousLeaseCR: operate() returned r=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 15 stack 0x56472270e640 end
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: stack->operate() returned ret=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 run: stack=0x56472270e640 is done
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3f540:op=0x564720d10000:18RGWDataSyncShardCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3f540:op=0x564720d10000:18RGWDataSyncShardCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 collect(): s=0x56471ec3f540 stack=0x56472270e640 encountered error (r=-16), skipping next stacks
2024-01-23T16:18:39.483+0000 7fd1b9155640 10 collect() returned ret=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3f540:op=0x564720d10000:18RGWDataSyncShardCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3f540:op=0x564720d10000:18RGWDataSyncShardCR: operate() returned r=-16
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 rgw rados thread: cr:s=0x56471ec3f540:op=0x56471ec4ca00:25RGWDataSyncShardControlCR: operate()
2024-01-23T16:18:39.483+0000 7fd1b9155640 20 run: stack=0x56471ec3f540 is io blocked
[...]

I then used gdb to get at least a backtrace of the thread that apparently was the only one "working":
(If I should have done anything else / more, please kindly let me know).

[Switching to thread 606 (Thread 0x7fd0ad565640 (LWP 3470489))]
#0  0x00007fd1dde91117 in ?? () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fd1dde91117 in ?? () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd1dde93a41 in pthread_cond_wait () from target:/lib/x86_64-linux-gnu/libc.so.6
#2  0x00005647189ea36b in RGWDataChangesLog::add_entry(DoutPrefixProvider const*, RGWBucketInfo const&, rgw::bucket_log_layout_generation const&, int, optional_yield) ()
#3  0x00005647187f18a9 in add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield) ()
#4  0x000056471882150d in RGWRados::Bucket::UpdateIndex::complete(DoutPrefixProvider const*, long, unsigned long, unsigned long, unsigned long, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*, RGWObjCategory, std::__cxx11::list<rgw_obj_index_key, std::allocator<rgw_obj_index_key> >*, optional_yield, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, bool) ()
#5  0x00005647188147ad in RGWRados::Object::Write::_do_write_meta(DoutPrefixProvider const*, unsigned long, unsigned long, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >&, bool, bool, void*, optional_yield) ()
#6  0x000056471881590f in RGWRados::Object::Write::write_meta(DoutPrefixProvider const*, unsigned long, unsigned long, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >&, optional_yield) ()
#7  0x00005647187e9e77 in rgw::putobj::AtomicObjectProcessor::complete(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, rgw_zone_set*, bool*, optional_yield) ()
#8  0x000056471865a041 in RGWPutObj::execute(optional_yield) ()
#9  0x000056471850b71c in rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool) ()
#10 0x000056471850fcd2 in process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*) ()
#11 0x000056471844b5db in ?? ()
#12 0x0000564718474ce9 in ?? ()
#13 0x0000564718f4ff5f in make_fcontext ()
#14 0x0000000000000000 in ?? ()

If I may ask ...

0) Am I hitting this very bug here then?
1) I suppose then this is somewhat already known debug info? Or could I provide anything that would help analyzing / fixing the issue?
2) May I ask if there is any workaround available for clusters running that are already running on Reef?
3) If not, should this then not lead to an announcement informing them that multisite sync is broken and to recommend holding off on upgrading?

Actions #22

Updated by Christian Rohmann 3 months ago

Enrico Bocchi wrote:

The delete issue has been reported at https://tracker.ceph.com/issues/63528, which has then been marked as duplicate and folded here.
The workaround with rgw_multi_obj_del_max_aio = 1 still holds.

Sorry I lost a little track of the current state of this issue and the fact the workaround (^^) seems to still be valid / working.
To add yet another data point: We applied this config and things seem to not freeze up anymore.

Actions #23

Updated by Christian Rohmann 3 months ago

Could this issue be tracked and considered for 18.2.2 - https://tracker.ceph.com/versions/678 ?

Actions #24

Updated by Casey Bodley 3 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 55522

i've prepared to partial revert of https://github.com/ceph/ceph/pull/49741 in https://github.com/ceph/ceph/pull/55522, along with a multisite test case that was able to reproduce the deadlock reliably

Actions #25

Updated by Casey Bodley 2 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Assignee changed from Adam Emerson to Casey Bodley
  • Backport changed from reef to reef squid
Actions #26

Updated by Backport Bot 2 months ago

  • Copied to Backport #64500: reef: multisite: Deadlock in RGWDeleteMultiObj with default rgw_multi_obj_del_max_aio > 1 added
Actions #27

Updated by Backport Bot 2 months ago

  • Copied to Backport #64501: squid: multisite: Deadlock in RGWDeleteMultiObj with default rgw_multi_obj_del_max_aio > 1 added
Actions #28

Updated by Backport Bot 2 months ago

  • Tags changed from fifo to fifo backport_processed
Actions #29

Updated by Christian Rohmann about 1 month ago

Does it make sense to also add something to the release notes for folks (like me) who set

rgw_multi_obj_del_max_aio=1

to mitigate the issue? Likely it makes sense to recommend removing this setting again?

Actions

Also available in: Atom PDF