Bug #57782
open[mon] high cpu usage by fn_monstore thread
0%
Description
We observed high cpu usage by ms_dispatch and fn_monstore thread (amounting to 100-99% in top) Ceph [ deployment was with rook ]
We ran gdb and perf tooling to understand better what might be the origin and found to be :
100.00% 0.00% ms_dispatch ceph-mon [.] Monitor::handle_command | ---Monitor::handle_command PaxosService::dispatch OSDMonitor::prepare_update OSDMonitor::prepare_command OSDMonitor::prepare_command_impl OSDMonitor::prepare_new_pool CrushTester::test_with_fork | --99.81%--__libc_close (inlined) | |--33.42%--entry_SYSCALL_64_after_hwframe | do_syscall_64 | | | |--22.22%--syscall_enter_from_user_mode | | | |--6.06%--__x64_sys_close | | | | | --4.57%--close_fd | | | | | --3.56%--pick_file | | | | | --2.72%--_raw_spin_lock | | | | | --1.38%--preempt_count_add | | | |--2.48%--syscall_exit_to_user_mode | | | | | --2.11%--syscall_exit_to_user_mode_prepare | | | | | --0.62%--__audit_syscall_exit | | | --1.84%--syscall_trace_enter.constprop.0 | | | --0.76%--__audit_syscall_entry | |--1.79%--__pthread_enable_asynccancel | --1.24%--__pthread_disable_asynccancel
CrushTester leading to libc_close consuming 99% of the cpu insteading of
creating a fork:
---Monitor::handle_command PaxosService::dispatch OSDMonitor::prepare_update OSDMonitor::prepare_command OSDMonitor::prepare_command_impl OSDMonitor::prepare_new_pool CrushTester::test_with_fork __libc_close (inlined) entry_SYSCALL_64_after_hwframe do_syscall_64 syscall_enter_from_user_mode
from mon logs termination can be observed
2022-09-15T14:29:06.155+0000 7fe451263700 10 -- [v2:10.6.125.48:3300/0,v1:10.6.125.48:6789/0] >> 10.6.137.218:0/260641845 conn(0x5638020d9400 msgr2=0x5637fa16f600 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 96 remaining bytes 0 2022-09-15T14:29:06.209+0000 7fe458271700 -1 received signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 2022-09-15T14:29:06.209+0000 7fe458271700 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
there has been no assert failure or anything of sorts, adding it here to see if anyone has any feedback/hit this issue so far.
Updated by Radoslaw Zarzynski over 1 year ago
- Related to Bug #46266: Monitor crashed in creating pool in CrushTester::test_with_fork() added
Updated by Radoslaw Zarzynski over 1 year ago
- Status changed from New to Need More Info
It looks we're burning CPU in close(2)
. The single call site I can spot is in write_data_set_to_csv
. Let's analyze the path to it from test_with_fork
.
int CrushTester::test_with_fork(int timeout)
{
ostringstream sink;
int r = fork_function(timeout, sink, [&]() {
return test();
});
if (r == -ETIMEDOUT) {
err << "timed out during smoke test (" << timeout << " seconds)";
}
return r;
}
int CrushTester::test()
{
// ...
for (int r = min_rule; r < crush.get_max_rules() && r <= max_rule; r++) {
// ...
for (int nr = min_rep; nr <= max_rep; nr++) {
// ...
if (output_csv)
write_data_set_to_csv(output_data_file_name+rule_tag,tester_data);
}
What those loops' conditions are wrong? Unfortunately, this code entirely lacks debugs, so verifying this hypothesis is problematic. How about putting them to code?
Updated by Deepika Upadhyay over 1 year ago
Hey Radek,
makes sense, I created a debug branch https://github.com/ceph/ceph-ci/pull/new/wip-crush-debug and might have packages tomorrow available, can rerun in either failing env or some teuthology test so that we can understand this better, will update soon.
Updated by Radoslaw Zarzynski over 1 year ago
NOT A FIX (extra debugs): https://github.com/ceph/ceph/pull/48513
Updated by Yuri Weinstein over 1 year ago
Updated by Radoslaw Zarzynski over 1 year ago
- Related to Feature #58168: extra debugs for: [mon] high cpu usage by fn_monstore thread added
Updated by Lucian Petrut about 1 year ago
We hit the same issue while creating pools:
#0 0x00007fd77d91b5a0 in close@plt () from /usr/lib64/ceph/libceph-common.so.2 #1 0x00007fd77df6c26b in CrushTester::test_with_fork(int) () from /usr/lib64/ceph/libceph-common.so.2 #2 0x000055beecf7825a in OSDMonitor::prepare_new_pool(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned long, OSDMonitor::FastReadType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::ostream*) () at /usr/src/debug/ceph-17.2.5-0.el8.x86_64/src/mon/OSDMonitor.cc:8058 #3 0x000055beecf97315 in OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&) () at /usr/src/debug/ceph-17.2.5-0.el8.x86_64/src/mon/OSDMonitor.cc:13100 #4 0x000055beecf9accd in OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>) () at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/atomic_base.h:377 #5 0x000055beecf9b31b in OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>) () at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/atomic_base.h:377 #6 0x000055beecf0ac0e in PaxosService::dispatch (this=0x55beef035000, op=...) at /usr/src/debug/ceph-17.2.5-0.el8.x86_64/src/mon/PaxosService.cc:96
The fd close hangs and takes 100% cpu. We were able to work around it by disabling the crush smoke test: mon_osd_crush_smoke_test = false.
In our case, it only occurs while running ceph-mon in a container on top of flatcar and runs fine when running the ceph-mon container on top of an ubuntu host.
Updated by Radoslaw Zarzynski about 1 year ago
In the upcoming v17.2.6 we'll have the extra debugs (see https://github.com/ceph/ceph/pull/50406). Would you mind to retaking there? The alternative would to make a custom build.
Updated by Peter Goron 11 months ago
Hello,
Reporting same issue with a ceph-16.2.10 cluster deployed on top of kubernetes (rook-1.9.12) when adding a new pool to an existing cluster.
ceph-mon subprocess doing 100% cpu.
(gdb) thread apply all bt Thread 1 (Thread 0x7fcc3f68f700 (LWP 2216616)): #0 0x00007fcc4af16a77 in close () from target:/lib64/libpthread.so.0 #1 0x00007fcc4d799da4 in CrushTester::test_with_fork(int) () from target:/usr/lib64/ceph/libceph-common.so.2 #2 0x000055d49cf23dfc in OSDMonitor::prepare_new_pool(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned long, OSDMonitor::FastReadType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::ostream*) () #3 0x000055d49cf52403 in OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&) () #4 0x000055d49cf575d4 in OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>) () #5 0x000055d49cf5b8b3 in OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>) () #6 0x000055d49cedbecd in PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>) () #7 0x000055d49cdc1834 in Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>) () #8 0x000055d49cdc6300 in Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>) () #9 0x000055d49cdc7650 in Monitor::_ms_dispatch(Message*) () #10 0x000055d49cdf654c in Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&) () #11 0x00007fcc4d421baa in DispatchQueue::entry() () from target:/usr/lib64/ceph/libceph-common.so.2 #12 0x00007fcc4d4d3891 in DispatchQueue::DispatchThread::entry() () from target:/usr/lib64/ceph/libceph-common.so.2 #13 0x00007fcc4af0d17a in start_thread () from target:/lib64/libpthread.so.0 #14 0x00007fcc49c3bdc3 in clone () from target:/lib64/libc.so.6
strace extract:
strace: Process 2216616 attached close(661150481) = -1 EBADF (Bad file descriptor) close(661150482) = -1 EBADF (Bad file descriptor) close(661150483) = -1 EBADF (Bad file descriptor) close(661150484) = -1 EBADF (Bad file descriptor) close(661150485) = -1 EBADF (Bad file descriptor) close(661150486) = -1 EBADF (Bad file descriptor) close(661150487) = -1 EBADF (Bad file descriptor) close(661150488) = -1 EBADF (Bad file descriptor) close(661150489) = -1 EBADF (Bad file descriptor) close(661150490) = -1 EBADF (Bad file descriptor) close(661150491) = -1 EBADF (Bad file descriptor) close(661150492) = -1 EBADF (Bad file descriptor) close(661150493) = -1 EBADF (Bad file descriptor) close(661150494) = -1 EBADF (Bad file descriptor) close(661150495) = -1 EBADF (Bad file descriptor) close(661150496) = -1 EBADF (Bad file descriptor) close(661150497) = -1 EBADF (Bad file descriptor) ...
I tried to reproduce smoketest with crushtool -i /tmp/crushmap --test --min_x 0 --min_x 50 --rule 8 --show-mappings without success so far.
https://github.com/ceph/ceph/blob/v16.2.10/src/mon/OSDMonitor.cc#L7986
if (g_conf()->mon_osd_crush_smoke_test) { CrushWrapper newcrush; _get_pending_crush(newcrush); ostringstream err; CrushTester tester(newcrush, err); tester.set_min_x(0); tester.set_max_x(50); tester.set_rule(crush_rule); auto start = ceph::coarse_mono_clock::now(); r = tester.test_with_fork(g_conf()->mon_lease); ...
I don't understand how "close" syscall can be invoked knowing that CrushTester csv_output is disabled by default, and not enabled by smoketest. I would bet somehow on a memory corruption.
Updated by Radoslaw Zarzynski 10 months ago
Would you mind trying with a fresh quincy (has some extra debugs)? Custom pacific build also might be a way.
Updated by Peter Goron 10 months ago
Hi Radoslaw,
I will try to reproduce first. We have disabled mon_osd_crush_smoke_test to create our pools.
If after re-enabling mon_osd_crush_smoke_test, crushmap is still in a state that triggers the problem, then I can try to do a custom build with your logs (for various reason we are stuck on 16.2.10 for now)
it will take a bit of time because I will have to create a custom build chain (we need to rebuild rpms & integrate non-official rpm sources in our docker images)
Updated by Peter Goron about 2 months ago
Hi,
Thanks to this article https://blog.palark.com/sre-troubleshooting-ceph-systemd-containerd/, I think root cause is clearIn our case:
- mon is running in a container in kube.
- new containerd version allows a high number of file descriptors
- fork code is trying to close parent file descriptors (inlined, reason why it doesn't appear explictly in the stacktrace)
Author of the article is his proposing a fix:
https://github.com/ceph/ceph/pull/50622/files
[root@kube-worker0033 ceph]# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ceph 1 1.9 1.4 1588672 979500 ? Ssl Mar12 32:21 ceph-mon --fsid=615c945a-4d04-4e06-95cc-7200ae7589b2 --keyring=/etc/ceph/keyring-store/keyring --log-to root 88365 0.2 0.0 14268 3428 pts/0 Ss 16:08 0:00 bash root 88387 0.0 0.0 46756 3436 pts/0 R+ 16:08 0:00 ps aux [root@kube-worker0033 ceph]# cat /proc/1/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes unlimited unlimited processes Max open files 1073741816 1073741816 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 255812 255812 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us
Updated by Radoslaw Zarzynski about 2 months ago
- Status changed from Need More Info to Fix Under Review
- Pull request ID set to 50622
The fix awaits QA.