Project

General

Profile

Bug #57782

[mon] high cpu usage by fn_monstore thread

Added by Deepika Upadhyay 4 months ago. Updated about 2 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We observed high cpu usage by ms_dispatch and fn_monstore thread (amounting to 100-99% in top) Ceph [ deployment was with rook ]

We ran gdb and perf tooling to understand better what might be the origin and found to be :

   100.00%     0.00%  ms_dispatch  ceph-mon             [.] Monitor::handle_command
            |
            ---Monitor::handle_command
               PaxosService::dispatch
               OSDMonitor::prepare_update
               OSDMonitor::prepare_command
               OSDMonitor::prepare_command_impl
               OSDMonitor::prepare_new_pool
               CrushTester::test_with_fork
               |          
                --99.81%--__libc_close (inlined)
                          |          
                          |--33.42%--entry_SYSCALL_64_after_hwframe
                          |          do_syscall_64
                          |          |          
                          |          |--22.22%--syscall_enter_from_user_mode
                          |          |          
                          |          |--6.06%--__x64_sys_close
                          |          |          |          
                          |          |           --4.57%--close_fd
                          |          |                     |          
                          |          |                      --3.56%--pick_file
                          |          |                                |          
                          |          |                                 --2.72%--_raw_spin_lock
                          |          |                                           |          
                          |          |                                            --1.38%--preempt_count_add
                          |          |          
                          |          |--2.48%--syscall_exit_to_user_mode
                          |          |          |          
                          |          |           --2.11%--syscall_exit_to_user_mode_prepare
                          |          |                     |          
                          |          |                      --0.62%--__audit_syscall_exit
                          |          |          
                          |           --1.84%--syscall_trace_enter.constprop.0
                          |                     |          
                          |                      --0.76%--__audit_syscall_entry
                          |          
                          |--1.79%--__pthread_enable_asynccancel
                          |          
                           --1.24%--__pthread_disable_asynccancel

CrushTester leading to libc_close consuming 99% of the cpu insteading of
creating a fork:

            ---Monitor::handle_command
               PaxosService::dispatch
               OSDMonitor::prepare_update
               OSDMonitor::prepare_command
               OSDMonitor::prepare_command_impl
               OSDMonitor::prepare_new_pool
               CrushTester::test_with_fork
               __libc_close (inlined)
               entry_SYSCALL_64_after_hwframe
               do_syscall_64
               syscall_enter_from_user_mode

from mon logs termination can be observed

2022-09-15T14:29:06.155+0000 7fe451263700 10 -- [v2:10.6.125.48:3300/0,v1:10.6.125.48:6789/0] >> 10.6.137.218:0/260641845 conn(0x5638020d9400 msgr2=0x5637fa16f600 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 96 remaining bytes 0
2022-09-15T14:29:06.209+0000 7fe458271700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2022-09-15T14:29:06.209+0000 7fe458271700 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***

there has been no assert failure or anything of sorts, adding it here to see if anyone has any feedback/hit this issue so far.


Related issues

Related to RADOS - Bug #46266: Monitor crashed in creating pool in CrushTester::test_with_fork() Need More Info
Related to RADOS - Feature #58168: extra debugs for: [mon] high cpu usage by fn_monstore thread Pending Backport

History

#1 Updated by Radoslaw Zarzynski 4 months ago

  • Related to Bug #46266: Monitor crashed in creating pool in CrushTester::test_with_fork() added

#2 Updated by Radoslaw Zarzynski 4 months ago

  • Status changed from New to Need More Info

It looks we're burning CPU in close(2). The single call site I can spot is in write_data_set_to_csv. Let's analyze the path to it from test_with_fork.

int CrushTester::test_with_fork(int timeout)
{
  ostringstream sink;
  int r = fork_function(timeout, sink, [&]() {
      return test();
    });
  if (r == -ETIMEDOUT) {
    err << "timed out during smoke test (" << timeout << " seconds)";
  }
  return r;
}

int CrushTester::test()
{
  // ...
  for (int r = min_rule; r < crush.get_max_rules() && r <= max_rule; r++) {
    // ...
    for (int nr = min_rep; nr <= max_rep; nr++) {
      // ...
      if (output_csv)
        write_data_set_to_csv(output_data_file_name+rule_tag,tester_data);
    }

What those loops' conditions are wrong? Unfortunately, this code entirely lacks debugs, so verifying this hypothesis is problematic. How about putting them to code?

#3 Updated by Deepika Upadhyay 4 months ago

Hey Radek,

makes sense, I created a debug branch https://github.com/ceph/ceph-ci/pull/new/wip-crush-debug and might have packages tomorrow available, can rerun in either failing env or some teuthology test so that we can understand this better, will update soon.

#4 Updated by Radoslaw Zarzynski 4 months ago

NOT A FIX (extra debugs): https://github.com/ceph/ceph/pull/48513

#5 Updated by Deepika Upadhyay 3 months ago

  • Pull request ID set to 48513

#6 Updated by Deepika Upadhyay 3 months ago

  • Pull request ID deleted (48513)

#7 Updated by Yuri Weinstein about 2 months ago

Radoslaw Zarzynski wrote:

NOT A FIX (extra debugs): https://github.com/ceph/ceph/pull/48513

merged

#8 Updated by Radoslaw Zarzynski about 2 months ago

  • Related to Feature #58168: extra debugs for: [mon] high cpu usage by fn_monstore thread added

Also available in: Atom PDF