Project

General

Profile

Actions

Bug #57782

open

[mon] high cpu usage by fn_monstore thread

Added by Deepika Upadhyay over 1 year ago. Updated about 1 month ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We observed high cpu usage by ms_dispatch and fn_monstore thread (amounting to 100-99% in top) Ceph [ deployment was with rook ]

We ran gdb and perf tooling to understand better what might be the origin and found to be :

   100.00%     0.00%  ms_dispatch  ceph-mon             [.] Monitor::handle_command
            |
            ---Monitor::handle_command
               PaxosService::dispatch
               OSDMonitor::prepare_update
               OSDMonitor::prepare_command
               OSDMonitor::prepare_command_impl
               OSDMonitor::prepare_new_pool
               CrushTester::test_with_fork
               |          
                --99.81%--__libc_close (inlined)
                          |          
                          |--33.42%--entry_SYSCALL_64_after_hwframe
                          |          do_syscall_64
                          |          |          
                          |          |--22.22%--syscall_enter_from_user_mode
                          |          |          
                          |          |--6.06%--__x64_sys_close
                          |          |          |          
                          |          |           --4.57%--close_fd
                          |          |                     |          
                          |          |                      --3.56%--pick_file
                          |          |                                |          
                          |          |                                 --2.72%--_raw_spin_lock
                          |          |                                           |          
                          |          |                                            --1.38%--preempt_count_add
                          |          |          
                          |          |--2.48%--syscall_exit_to_user_mode
                          |          |          |          
                          |          |           --2.11%--syscall_exit_to_user_mode_prepare
                          |          |                     |          
                          |          |                      --0.62%--__audit_syscall_exit
                          |          |          
                          |           --1.84%--syscall_trace_enter.constprop.0
                          |                     |          
                          |                      --0.76%--__audit_syscall_entry
                          |          
                          |--1.79%--__pthread_enable_asynccancel
                          |          
                           --1.24%--__pthread_disable_asynccancel

CrushTester leading to libc_close consuming 99% of the cpu insteading of
creating a fork:

            ---Monitor::handle_command
               PaxosService::dispatch
               OSDMonitor::prepare_update
               OSDMonitor::prepare_command
               OSDMonitor::prepare_command_impl
               OSDMonitor::prepare_new_pool
               CrushTester::test_with_fork
               __libc_close (inlined)
               entry_SYSCALL_64_after_hwframe
               do_syscall_64
               syscall_enter_from_user_mode

from mon logs termination can be observed

2022-09-15T14:29:06.155+0000 7fe451263700 10 -- [v2:10.6.125.48:3300/0,v1:10.6.125.48:6789/0] >> 10.6.137.218:0/260641845 conn(0x5638020d9400 msgr2=0x5637fa16f600 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 96 remaining bytes 0
2022-09-15T14:29:06.209+0000 7fe458271700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2022-09-15T14:29:06.209+0000 7fe458271700 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***

there has been no assert failure or anything of sorts, adding it here to see if anyone has any feedback/hit this issue so far.


Related issues 2 (2 open0 closed)

Related to RADOS - Bug #46266: Monitor crashed in creating pool in CrushTester::test_with_fork()Need More Info

Actions
Related to RADOS - Feature #58168: extra debugs for: [mon] high cpu usage by fn_monstore threadPending Backport

Actions
Actions #1

Updated by Radoslaw Zarzynski over 1 year ago

  • Related to Bug #46266: Monitor crashed in creating pool in CrushTester::test_with_fork() added
Actions #2

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to Need More Info

It looks we're burning CPU in close(2). The single call site I can spot is in write_data_set_to_csv. Let's analyze the path to it from test_with_fork.

int CrushTester::test_with_fork(int timeout)
{
  ostringstream sink;
  int r = fork_function(timeout, sink, [&]() {
      return test();
    });
  if (r == -ETIMEDOUT) {
    err << "timed out during smoke test (" << timeout << " seconds)";
  }
  return r;
}

int CrushTester::test()
{
  // ...
  for (int r = min_rule; r < crush.get_max_rules() && r <= max_rule; r++) {
    // ...
    for (int nr = min_rep; nr <= max_rep; nr++) {
      // ...
      if (output_csv)
        write_data_set_to_csv(output_data_file_name+rule_tag,tester_data);
    }

What those loops' conditions are wrong? Unfortunately, this code entirely lacks debugs, so verifying this hypothesis is problematic. How about putting them to code?

Actions #3

Updated by Deepika Upadhyay over 1 year ago

Hey Radek,

makes sense, I created a debug branch https://github.com/ceph/ceph-ci/pull/new/wip-crush-debug and might have packages tomorrow available, can rerun in either failing env or some teuthology test so that we can understand this better, will update soon.

Actions #4

Updated by Radoslaw Zarzynski over 1 year ago

NOT A FIX (extra debugs): https://github.com/ceph/ceph/pull/48513

Actions #5

Updated by Deepika Upadhyay over 1 year ago

  • Pull request ID set to 48513
Actions #6

Updated by Deepika Upadhyay over 1 year ago

  • Pull request ID deleted (48513)
Actions #7

Updated by Yuri Weinstein over 1 year ago

Radoslaw Zarzynski wrote:

NOT A FIX (extra debugs): https://github.com/ceph/ceph/pull/48513

merged

Actions #8

Updated by Radoslaw Zarzynski over 1 year ago

  • Related to Feature #58168: extra debugs for: [mon] high cpu usage by fn_monstore thread added
Actions #9

Updated by Lucian Petrut about 1 year ago

We hit the same issue while creating pools:

#0  0x00007fd77d91b5a0 in close@plt () from /usr/lib64/ceph/libceph-common.so.2
#1  0x00007fd77df6c26b in CrushTester::test_with_fork(int) () from /usr/lib64/ceph/libceph-common.so.2
#2  0x000055beecf7825a in OSDMonitor::prepare_new_pool(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned long, OSDMonitor::FastReadType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::ostream*) ()
    at /usr/src/debug/ceph-17.2.5-0.el8.x86_64/src/mon/OSDMonitor.cc:8058
#3  0x000055beecf97315 in OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&) () at /usr/src/debug/ceph-17.2.5-0.el8.x86_64/src/mon/OSDMonitor.cc:13100
#4  0x000055beecf9accd in OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>) () at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/atomic_base.h:377
#5  0x000055beecf9b31b in OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>) () at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/atomic_base.h:377
#6  0x000055beecf0ac0e in PaxosService::dispatch (this=0x55beef035000, op=...) at /usr/src/debug/ceph-17.2.5-0.el8.x86_64/src/mon/PaxosService.cc:96

The fd close hangs and takes 100% cpu. We were able to work around it by disabling the crush smoke test: mon_osd_crush_smoke_test = false.

In our case, it only occurs while running ceph-mon in a container on top of flatcar and runs fine when running the ceph-mon container on top of an ubuntu host.

Actions #10

Updated by Radoslaw Zarzynski about 1 year ago

In the upcoming v17.2.6 we'll have the extra debugs (see https://github.com/ceph/ceph/pull/50406). Would you mind to retaking there? The alternative would to make a custom build.

Actions #11

Updated by Peter Goron 11 months ago

Hello,

Reporting same issue with a ceph-16.2.10 cluster deployed on top of kubernetes (rook-1.9.12) when adding a new pool to an existing cluster.
ceph-mon subprocess doing 100% cpu.

(gdb) thread apply all bt

Thread 1 (Thread 0x7fcc3f68f700 (LWP 2216616)):
#0  0x00007fcc4af16a77 in close () from target:/lib64/libpthread.so.0
#1  0x00007fcc4d799da4 in CrushTester::test_with_fork(int) () from target:/usr/lib64/ceph/libceph-common.so.2
#2  0x000055d49cf23dfc in OSDMonitor::prepare_new_pool(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned long, OSDMonitor::FastReadType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::ostream*) ()
#3  0x000055d49cf52403 in OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&) ()
#4  0x000055d49cf575d4 in OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>) ()
#5  0x000055d49cf5b8b3 in OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>) ()
#6  0x000055d49cedbecd in PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>) ()
#7  0x000055d49cdc1834 in Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>) ()
#8  0x000055d49cdc6300 in Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>) ()
#9  0x000055d49cdc7650 in Monitor::_ms_dispatch(Message*) ()
#10 0x000055d49cdf654c in Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&) ()
#11 0x00007fcc4d421baa in DispatchQueue::entry() () from target:/usr/lib64/ceph/libceph-common.so.2
#12 0x00007fcc4d4d3891 in DispatchQueue::DispatchThread::entry() () from target:/usr/lib64/ceph/libceph-common.so.2
#13 0x00007fcc4af0d17a in start_thread () from target:/lib64/libpthread.so.0
#14 0x00007fcc49c3bdc3 in clone () from target:/lib64/libc.so.6

strace extract:

strace: Process 2216616 attached
close(661150481)                        = -1 EBADF (Bad file descriptor)
close(661150482)                        = -1 EBADF (Bad file descriptor)
close(661150483)                        = -1 EBADF (Bad file descriptor)
close(661150484)                        = -1 EBADF (Bad file descriptor)
close(661150485)                        = -1 EBADF (Bad file descriptor)
close(661150486)                        = -1 EBADF (Bad file descriptor)
close(661150487)                        = -1 EBADF (Bad file descriptor)
close(661150488)                        = -1 EBADF (Bad file descriptor)
close(661150489)                        = -1 EBADF (Bad file descriptor)
close(661150490)                        = -1 EBADF (Bad file descriptor)
close(661150491)                        = -1 EBADF (Bad file descriptor)
close(661150492)                        = -1 EBADF (Bad file descriptor)
close(661150493)                        = -1 EBADF (Bad file descriptor)
close(661150494)                        = -1 EBADF (Bad file descriptor)
close(661150495)                        = -1 EBADF (Bad file descriptor)
close(661150496)                        = -1 EBADF (Bad file descriptor)
close(661150497)                        = -1 EBADF (Bad file descriptor)
...

I tried to reproduce smoketest with crushtool -i /tmp/crushmap --test --min_x 0 --min_x 50 --rule 8 --show-mappings without success so far.

https://github.com/ceph/ceph/blob/v16.2.10/src/mon/OSDMonitor.cc#L7986

  if (g_conf()->mon_osd_crush_smoke_test) {
    CrushWrapper newcrush;
    _get_pending_crush(newcrush);
    ostringstream err;
    CrushTester tester(newcrush, err);
    tester.set_min_x(0);
    tester.set_max_x(50);
    tester.set_rule(crush_rule);
    auto start = ceph::coarse_mono_clock::now();
    r = tester.test_with_fork(g_conf()->mon_lease);
    ...

I don't understand how "close" syscall can be invoked knowing that CrushTester csv_output is disabled by default, and not enabled by smoketest. I would bet somehow on a memory corruption.

Actions #12

Updated by Radoslaw Zarzynski 10 months ago

Would you mind trying with a fresh quincy (has some extra debugs)? Custom pacific build also might be a way.

Actions #13

Updated by Peter Goron 10 months ago

Hi Radoslaw,
I will try to reproduce first. We have disabled mon_osd_crush_smoke_test to create our pools.
If after re-enabling mon_osd_crush_smoke_test, crushmap is still in a state that triggers the problem, then I can try to do a custom build with your logs (for various reason we are stuck on 16.2.10 for now)

it will take a bit of time because I will have to create a custom build chain (we need to rebuild rpms & integrate non-official rpm sources in our docker images)

Actions #14

Updated by Peter Goron about 1 month ago

Hi,

Thanks to this article https://blog.palark.com/sre-troubleshooting-ceph-systemd-containerd/, I think root cause is clear
In our case:
  • mon is running in a container in kube.
  • new containerd version allows a high number of file descriptors
  • fork code is trying to close parent file descriptors (inlined, reason why it doesn't appear explictly in the stacktrace)

Author of the article is his proposing a fix:
https://github.com/ceph/ceph/pull/50622/files

[root@kube-worker0033 ceph]# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph           1  1.9  1.4 1588672 979500 ?      Ssl  Mar12  32:21 ceph-mon --fsid=615c945a-4d04-4e06-95cc-7200ae7589b2 --keyring=/etc/ceph/keyring-store/keyring --log-to
root       88365  0.2  0.0  14268  3428 pts/0    Ss   16:08   0:00 bash
root       88387  0.0  0.0  46756  3436 pts/0    R+   16:08   0:00 ps aux

[root@kube-worker0033 ceph]# cat /proc/1/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            1073741816           1073741816           files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       255812               255812               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Actions #15

Updated by Radoslaw Zarzynski about 1 month ago

  • Status changed from Need More Info to Fix Under Review
  • Pull request ID set to 50622

The fix awaits QA.

Actions

Also available in: Atom PDF