Bug #64103
openmon: ms_dispatch thread taking 100% CPU for ~2 minutes
0%
Description
Issue¶
During the start-up phase of a minimalistic vstart cluster (v18.2.1, 1 mon, 1 mgr, 3 memstore OSD) inside a container (Fedora 38, Docker CE 24.0.7, build afdd53b), upon receiving the `ceph osd pool create rbd...` ceph-mon's ms_distpach threads start consuming 100% during ~120 seconds. After that another it produces another 100% CPU, and then the cluster comes back to normal.
An strace -kk -f -p $(pidof -s ceph-mon)
inside the container during the 100% phase, produces the following dump:
1034 close(64692) = -1 EBADF (Bad file descriptor)
> /usr/lib64/libc.so.6(__close+0x3b) [0x13f3cb]
> /usr/lib64/ceph/libceph-common.so.2(CrushTester::test_with_fork(ceph::common::CephContext*, int)+0x3c2) [0x6222b2]
> /usr/bin/ceph-mon(OSDMonitor::prepare_new_pool(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int,
unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, float, std::__cxx11::basic_string<char, std::c
har_traits<char>, std::allocator<char> > const&, unsigned int, unsigned long, OSDMonitor::FastReadType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, std::ostream*)+0x532) [0x47a8a2]
> /usr/bin/ceph-mon(OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&)+0x1af07) [0x4a0167]
> /usr/bin/ceph-mon(OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x314) [0x484a14]
> /usr/bin/ceph-mon(OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x14a) [0x44c17a]
> /usr/bin/ceph-mon(PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x81a) [0x422d6a]
> /usr/bin/ceph-mon(Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x23ef) [0x32c0bf]
> /usr/bin/ceph-mon(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x9e7) [0x3380e7]
> /usr/bin/ceph-mon(Monitor::_ms_dispatch(Message*)+0x7b0) [0x339390]
> /usr/bin/ceph-mon(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x4d) [0x2f3e8d]
> /usr/lib64/ceph/libceph-common.so.2(DispatchQueue::entry()+0x539) [0x358cc9]
> /usr/lib64/ceph/libceph-common.so.2(DispatchQueue::DispatchThread::entry()+0x10) [0x3e9750]
> /usr/lib64/libc.so.6(start_thread+0x2d1) [0xa1911]
> /usr/lib64/libc.so.6(__clone+0x43) [0x3f313]
1034 close(64693) = -1 EBADF (Bad file descriptor)
For the first 0 to 46 file descriptors, the close succeeds, but from 47 onwards it produces a EBADF error.
Ceph mon logs during that CPU peak are like:
2024-01-18T00:54:22.254+0000 7f1dbd151640 1 --2- [v2:192.168.13.2:10000/0,v1:192.168.13.2:10001/0] >> conn(0x558310bde000 0x558310fad080 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0
2024-01-18T00:54:22.254+0000 7f1dbd151640 10 mon.a@0(leader) e1 handle_auth_request con 0x558310bde000 (start) method 2 payload 22
2024-01-18T00:54:22.254+0000 7f1dbd151640 10 mon.a@0(leader).auth v6 _assign_global_id 4156 (max 14096)
2024-01-18T00:54:22.254+0000 7f1dbd151640 10 mon.a@0(leader) e1 handle_auth_request con 0x558310bde000 (more) method 2 payload 36
2024-01-18T00:54:22.254+0000 7f1dbd151640 10 mon.a@0(leader) e1 ms_handle_fast_authentication session 0x558310c0f200 con 0x558310bde000 addr - MonSession(unknown.0 is open , features 0x0 (unknown))
2024-01-18T00:54:22.255+0000 7f1dbd151640 1 --2- [v2:192.168.13.2:10000/0,v1:192.168.13.2:10001/0] >> 192.168.13.2:0/2799323384 conn(0x558310bde000 0x558310fad080 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 crypto rx=0x5583117633b0 tx=0x558311788960 comp rx=0 tx=0).ready entity=client.? client_cookie=e1596264e13397d6 server_cookie=0 in_seq=0 out_seq=0
2024-01-18T00:54:24.731+0000 7f1db8147640 1 -- 192.168.13.2:0/94 --> [v2:192.168.13.2:6800/263,v1:192.168.13.2:6801/263] -- mgrreport(mon.a +0-0 packed 1150) v9 -- 0x558310c0d180 con 0x558310ee5c00
2024-01-18T00:54:29.731+0000 7f1db8147640 1 -- 192.168.13.2:0/94 --> [v2:192.168.13.2:6800/263,v1:192.168.13.2:6801/263] -- mgrreport(mon.a +0-0 packed 1150) v9 -- 0x558310c0d500 con 0x558310ee5c00
Environment¶
Reproduced with Docker CE v24.0.7 (not with Moby-engine Docker v24.0.7, the Fedora default) in both Fedora and Photon OS (VMWare's OS for VMs).
Ulimit inside Docker CE environment:
ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 126697
max locked memory (kbytes, -l) 8192
max memory size (kbytes, -m) unlimited
open files (-n) unlimited
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Ulimit inside Docker Moby-engine:
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127006
max locked memory (kbytes, -l) 8192
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
The vstart cluster is created and launched as in the ceph nvmeof repo. It's RPM based (v18.2.1 packages), but uses vstart for the configuration and launch of the services.
git clone https://github.com/ceph/ceph-nvmeof.git
cd ceph-nvmeof
make pull SVC=ceph
make up SVC=ceph
# In a different terminal
make shell SVC=ceph
top
Workaround¶
Setting ulimit -n 1024
in the container is enough to "fix" this issue.
Updated by Ernesto Puerta 4 months ago
- Description updated (diff)
- Component(RADOS) deleted (
Messenger)