Bug #64103
Updated by Ernesto Puerta 4 months ago
h3. Issue
During the start-up phase of a minimalistic vstart cluster (v18.2.1, 1 mon, 1 mgr, 3 memstore OSD) inside a container (Fedora 38, Docker CE 24.0.7, build afdd53b), upon receiving the `ceph osd pool create rbd...` ceph-mon's ms_distpach threads start consuming 100% during ~120 seconds. After that another it produces another 100% CPU, and then the cluster comes back to normal.
An @strace -kk -f -p $(pidof -s ceph-mon)@ inside the container during the 100% phase, produces the following dump:
<pre><code class="text">
1034 close(64692) = -1 EBADF (Bad file descriptor)
> /usr/lib64/libc.so.6(__close+0x3b) [0x13f3cb]
> /usr/lib64/ceph/libceph-common.so.2(CrushTester::test_with_fork(ceph::common::CephContext*, int)+0x3c2) [0x6222b2]
> /usr/bin/ceph-mon(OSDMonitor::prepare_new_pool(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int,
unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, float, std::__cxx11::basic_string<char, std::c
har_traits<char>, std::allocator<char> > const&, unsigned int, unsigned long, OSDMonitor::FastReadType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, std::ostream*)+0x532) [0x47a8a2]
> /usr/bin/ceph-mon(OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&)+0x1af07) [0x4a0167]
> /usr/bin/ceph-mon(OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x314) [0x484a14]
> /usr/bin/ceph-mon(OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x14a) [0x44c17a]
> /usr/bin/ceph-mon(PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x81a) [0x422d6a]
> /usr/bin/ceph-mon(Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x23ef) [0x32c0bf]
> /usr/bin/ceph-mon(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x9e7) [0x3380e7]
> /usr/bin/ceph-mon(Monitor::_ms_dispatch(Message*)+0x7b0) [0x339390]
> /usr/bin/ceph-mon(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x4d) [0x2f3e8d]
> /usr/lib64/ceph/libceph-common.so.2(DispatchQueue::entry()+0x539) [0x358cc9]
> /usr/lib64/ceph/libceph-common.so.2(DispatchQueue::DispatchThread::entry()+0x10) [0x3e9750]
> /usr/lib64/libc.so.6(start_thread+0x2d1) [0xa1911]
> /usr/lib64/libc.so.6(__clone+0x43) [0x3f313]
1034 close(64693) = -1 EBADF (Bad file descriptor)
</code></pre>
For the first 0 to 46 file descriptors, the close succeeds, but from 47 onwards it produces a EBADF error.
Ceph mon logs during that CPU peak are like:
<pre><code class="text">
2024-01-18T00:54:22.254+0000 7f1dbd151640 1 --2- [v2:192.168.13.2:10000/0,v1:192.168.13.2:10001/0] >> conn(0x558310bde000 0x558310fad080 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0
2024-01-18T00:54:22.254+0000 7f1dbd151640 10 mon.a@0(leader) e1 handle_auth_request con 0x558310bde000 (start) method 2 payload 22
2024-01-18T00:54:22.254+0000 7f1dbd151640 10 mon.a@0(leader).auth v6 _assign_global_id 4156 (max 14096)
2024-01-18T00:54:22.254+0000 7f1dbd151640 10 mon.a@0(leader) e1 handle_auth_request con 0x558310bde000 (more) method 2 payload 36
2024-01-18T00:54:22.254+0000 7f1dbd151640 10 mon.a@0(leader) e1 ms_handle_fast_authentication session 0x558310c0f200 con 0x558310bde000 addr - MonSession(unknown.0 is open , features 0x0 (unknown))
2024-01-18T00:54:22.255+0000 7f1dbd151640 1 --2- [v2:192.168.13.2:10000/0,v1:192.168.13.2:10001/0] >> 192.168.13.2:0/2799323384 conn(0x558310bde000 0x558310fad080 secure :-1 s=READY pgs=1 cs=0 l=1 rev1=1 crypto rx=0x5583117633b0 tx=0x558311788960 comp rx=0 tx=0).ready entity=client.? client_cookie=e1596264e13397d6 server_cookie=0 in_seq=0 out_seq=0
2024-01-18T00:54:24.731+0000 7f1db8147640 1 -- 192.168.13.2:0/94 --> [v2:192.168.13.2:6800/263,v1:192.168.13.2:6801/263] -- mgrreport(mon.a +0-0 packed 1150) v9 -- 0x558310c0d180 con 0x558310ee5c00
2024-01-18T00:54:29.731+0000 7f1db8147640 1 -- 192.168.13.2:0/94 --> [v2:192.168.13.2:6800/263,v1:192.168.13.2:6801/263] -- mgrreport(mon.a +0-0 packed 1150) v9 -- 0x558310c0d500 con 0x558310ee5c00
</code></pre>
h3. Environment
Reproduced with "Docker CE v24.0.7":https://docs.docker.com/engine/install/fedora/ (not with Moby-engine Docker v24.0.7, the Fedora default) in both Fedora and Photon OS (VMWare's OS for VMs).
Ulimit inside Docker CE environment:
<pre><code>
ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 126697
max locked memory (kbytes, -l) 8192
max memory size (kbytes, -m) unlimited
open files (-n) unlimited
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
</code></pre>
Ulimit inside Docker Moby-engine:
<pre><code>
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127006
max locked memory (kbytes, -l) 8192
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
</code></pre>
The vstart cluster is created and launched as in "the ceph nvmeof repo":https://github.com/ceph/ceph-nvmeof. It's RPM based (v18.2.1 packages), but "uses vstart for the configuration and launch of the services":https://github.com/ceph/ceph-nvmeof/blob/devel/Dockerfile.ceph#L62-L93.
<pre><code class="text">
git clone https://github.com/ceph/ceph-nvmeof.git
cd ceph-nvmeof
make pull SVC=ceph
make up SVC=ceph
# In a different terminal
make shell SVC=ceph
top
</code></pre>
h3. Workaround
Setting @ulimit -n 1024@ in the container is enough to "fix" this issue.