Bug #41511: civetweb threads using 100% of CPU - rgw - Ceph

Actions

Copy link

Bug #41511

closed

civetweb threads using 100% of CPU

Added by Vladimir Brik over 4 years ago. Updated over 4 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of multiple CPU cores (I've seen between 1 and 5) seemingly indefinitely, even though the affected machine(s) are not being used for data transfers (nothing in radosgw logs, couple of KB/s of network).

This situation can affect any of our rados gateways, but I haven't seen more than two affected concurrently (probably because radosgw processes were being restarted frequently recently).

I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl"

This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log

(error log file doesn't exist)

Here are the fruits of my attempts to capture the call graph using perf and gdbpmp:
https://icecube.wisc.edu/~vbrik/perf.data
https://icecube.wisc.edu/~vbrik/gdbpmp.data

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by J. Eric Ivancich over 4 years ago

I did some initial analysis on the profiling data that Vladimir provided. Of the 5078 threads, 5073 were essentially sleeping. The five that were active were in call stacks resulting from a GetObj request, and specifically in one of:

get_obj_data::flush(rgw::OwningList<rgw::AioResultEntry>&&)
ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JEEEEUlRKT_RKT0_E_EEvSH

Note: get_obj_data::flush does call _ZN5boost9intrusive9list_impl...

Actions

Copy link

Updated by J. Eric Ivancich over 4 years ago

Vladimir, would you clear something up. In the description here you state that once this starts, that once a CPU is consumed 100% by this, it is "seemingly indefinitely".

In your August 21, 2019 email you state: "This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own."

Can you explain more as to whether it self-resolves or not (i.e., only by restarting an rgw)?

Actions

Copy link

Updated by Vladimir Brik over 4 years ago

Eric Ivancich wrote:

Vladimir, would you clear something up. In the description here you state that once this starts, that once a CPU is consumed 100% by this, it is "seemingly indefinitely".

In your August 21, 2019 email you state: "This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own."

Can you explain more as to whether it self-resolves or not (i.e., only by restarting an rgw)?

I can't find solid evidence to indicate that the problem ever self-resolved. Unfortunately, it's hard for me to be sure.

When I wrote the email I thought that in some instances the problem self-resolved. However, I reviewed our monitoring and it looks like what I stated in the email is probably incorrect. It looks like in the cases when I thought the problem self-resolved, either the affected machine was rebooted or radosgw service was restarted.

12 days is the longest stretch of time where radosgw thread(s) consumed 100% of CPU core(s) before the service was restarted or machine rebooted.

We only really started to use radosgw heavily in August, so there aren't a lot of data points for this, unfortunately.

Actions

Copy link

Updated by J. Eric Ivancich over 4 years ago

Vladimir Brik wrote:

Eric Ivancich wrote:

Can you explain more as to whether it self-resolves or not (i.e., only by restarting an rgw)?

I can't find solid evidence to indicate that the problem ever self-resolved. Unfortunately, it's hard for me to be sure.

When I wrote the email I thought that in some instances the problem self-resolved. However, I reviewed our monitoring and it looks like what I stated in the email is probably incorrect. It looks like in the cases when I thought the problem self-resolved, either the affected machine was rebooted or radosgw service was restarted.

cbodley thought it might be an infinite loop, but they tend not to self-resolve. But it's now sounding as though his hunch is correct....

Actions

Copy link

Updated by Casey Bodley over 4 years ago

potentially related to https://tracker.ceph.com/issues/39660, which also involves get_obj_data::flush()

Actions

Copy link

Updated by Arnaud Lefebvre over 4 years ago

Hello,

It seems we are also experiencing this issue but with the beast frontend. We were using ceph v13.2.x before upgrading to v14.2.1 and then v14.2.2 a few weeks ago. We didn't have any problems with the v13.2.x release line but we only started noticing this issue when our load increased on the cluster.

Here's a gdb stacktrace at where the process is when attaching multiple times:

(gdb) bt
#0  0x000055ca3f3e9ff3 in get_obj_data::flush(rgw::OwningList<rgw::AioResultEntry>&&) ()
#1  0x000055ca3f3baacc in RGWRados::get_obj_iterate_cb(rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*) ()
#2  0x000055ca3f3bac84 in _get_obj_iterate_cb(rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*) ()
#3  0x000055ca3f3cdab1 in RGWRados::iterate_obj(RGWObjectCtx&, RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int (*)(rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*), void*) ()
#4  0x000055ca3f3d1088 in RGWRados::Object::Read::iterate(long, long, RGWGetDataCB*) ()
#5  0x000055ca3f35aefc in RGWGetObj::execute() ()
#6  0x000055ca3f12842b in rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool) ()
#7  0x000055ca3f12ae14 in process_request(RGWRados*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, int*) ()
#8  0x000055ca3f0888ba in void (anonymous namespace)::handle_connection<boost::asio::basic_stream_socket<boost::asio::ip::tcp> >(RGWProcessEnv&, boost::asio::basic_stream_socket<boost::asio::ip::tcp>&, boost::beast::basic_flat_buffer<std::allocator<char> >&, bool, ceph::async::SharedMutex<boost::asio::io_context::executor_type>&, rgw::dmclock::Scheduler*, boost::system::error_code&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) [clone .constprop.4035] ()
#9  0x000055ca3f089523 in void boost::coroutines::detail::trampoline_push_void<boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>, void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::executor_type> >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::{lambda(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)#3}>&, boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits> > >(boost::context::detail::transfer_t) ()
#10 0x00007f97b77821ef in make_fcontext () from target:/usr/x86_64-pc-linux-gnu/lib/libboost_context.so.1.69.0
#11 0x000055ca3f758d10 in vtable for boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>, void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::executor_type> >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::{lambda(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)#3}>&, boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits> > ()
#12 0x00007f9700000026 in ?? ()
#13 0x0000000000000000 in ?? ()

When strace'ing that process, it doesn't log anything.

We may have the same crash issues as described in https://tracker.ceph.com/issues/39660 but I don't have a useful coredump with a stacktrace yet, I'll try to get one.

Each time it happens we restart the radosgw process to resolve the issue.

Let me know if I can get you anything else useful.

Actions

Copy link

Updated by Paul Emmerich over 4 years ago

I've also encountered this hang on a 14.2.2 radosgw; we are using beast and only one core got stuck at 100% cpu load.

perf showed it was stuck in a loop at the beginning on get_obj_data::flush:

38: mov (%rdx),%rdx
    cmp %rdx,%rcx
    je 70
    mov -0x8(%rdx),%eax
    test %eax,%eax
    jns 38

I unfortunately couldn't install debug symbols and I had to get it working again by restarting it before I could debug this further. But it definitely looks like this issue

Actions

Copy link