Bug #44363
Using rdma protocol stack, ceph-mon reports too many open files exceeding 65536 causing link failure
0%
Description
Using rdma protocol stack, ceph-mon shows too many open files exceeding 65536 causing link failure.
Does ceph-mon not close the file when the link is disconnected? Constantly establishing new links and opening new files will cause too many open files to fail.
2020-03-01T04:33:48.356+0800 ffff93b94080 1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer accept open file descriptions limit reached sd = 42 errno
2020-03-01T04:33:53.644+0800 ffff94395080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:33:56.080+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:33:57.100+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:02.880+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:04.664+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:06.528+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:06.540+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:13.560+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:25.280+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:25.868+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:39.980+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:48.868+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:08.640+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:23.288+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:28.896+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:32.596+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:57.720+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:36:15.776+0800 ffff94395080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -24 (24) Too many open files accept open file descriptions limit reached sd = 42 errno
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -24 (24) Too many open files accept open file descriptions limit reached sd = 42 errno
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -24 (24) Too many open files accept open file descriptions limit reached sd = 42 errno
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -24 (24) Too many open files accept open file descriptions limit reached sd = 42 errno
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -24 (24) Too many open files Proccessor accept has encountered enough error numbers, just do ceph_abort().
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -
/home/chunsong/ceph/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread ffff8eb8a080 time 2020-03-01T04:36:50.000291+0800
/home/chunsong/ceph/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called")
ceph version 15.1.0-16-g9bfc37687b (9bfc37687b6645bb38e1a9f3da81148c09c19a28) octopus (rc)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe4) [0xffff98b3a924]
2: (Processor::accept()+0x96c) [0xffff98de861c]
3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x51c) [0xffff98e38d54]
4: (()+0x5ac080) [0xffff98e41080]
5: (std::function<void ()>::operator()() const+0x34) [0xaaaadc818780]
6: (void std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&)+0x1c) [0xaaaadcb3bd24]
7: (std::__invoke_result<std::function<void ()>>::type std::__invoke<std::function<void ()>>(std::function<void ()>&&)+0x38) [0xaaaadcb3bca8]
8: (void std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>)+0x20) [0xaaaadcb3bc30]
9: (std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()()+0x28) [0xaaaadcb3bbe8]
10: (std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run()+0x18) [0xaaaadcb3bbb4]
11: (()+0xc9ed4) [0xffff985a7ed4]
12: (()+0x7088) [0xffff986fe088]
2020-03-01T04:36:50.004+0800 ffff8eb8a080 -1 /home/chunsong/ceph/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread ffff8eb8a080 time 2020-03-01T04:36:50.000291+0800
/home/chunsong/ceph/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called")
ceph version 15.1.0-16-g9bfc37687b (9bfc37687b6645bb38e1a9f3da81148c09c19a28) octopus (rc)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe4) [0xffff98b3a924]
2: (Processor::accept()+0x96c) [0xffff98de861c]
3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x51c) [0xffff98e38d54]
4: (()+0x5ac080) [0xffff98e41080]
5: (std::function<void ()>::operator()() const+0x34) [0xaaaadc818780]
6: (void std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&)+0x1c) [0xaaaadcb3bd24]
7: (std::__invoke_result<std::function<void ()>>::type std::__invoke<std::function<void ()>>(std::function<void ()>&&)+0x38) [0xaaaadcb3bca8]
8: (void std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>)+0x20) [0xaaaadcb3bc30]
9: (std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()()+0x28) [0xaaaadcb3bbe8]
10: (std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run()+0x18) [0xaaaadcb3bbb4]
11: (()+0xc9ed4) [0xffff985a7ed4]
12: (()+0x7088) [0xffff986fe088]
- Caught signal (Aborted) **
in thread ffff8eb8a080 thread_name:msgr-worker-0
ceph version 15.1.0-16-g9bfc37687b (9bfc37687b6645bb38e1a9f3da81148c09c19a28) octopus (rc)
1: (__kernel_rt_sigreturn()+0) [0xffffa17f75c0]
2: (raise()+0xb0) [0xffff982d94d8]
terminate called after throwing an instance of 'std::runtime_error'
what(): random_device::random_device(const std::string&)
root@node1:~/ssd#
[1]+ Aborted (core dumped) /usr/bin/ceph-mon -f --cluster ceph --id node1 --setuser ceph --setgroup ceph (wd: /MLNX_OFED_LINUX-4.7-1.0.0.1-ubuntu18.04-aarch64)
(wd now: ~/ssd)
root@node1:/ssd# ulimit -a
core file size (blocks, -c) 10240
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2057959
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 2057959
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
root@node1:~/ssd# ceph-conf -D | grep ms_max_accept_failures
ms_max_accept_failures = 4
History
#1 Updated by Neha Ojha almost 4 years ago
- Project changed from bluestore to Messengers