Project

General

Profile

Bug #44363

Using rdma protocol stack, ceph-mon reports too many open files exceeding 65536 causing link failure

Added by chunsong feng about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Using rdma protocol stack, ceph-mon shows too many open files exceeding 65536 causing link failure.
Does ceph-mon not close the file when the link is disconnected? Constantly establishing new links and opening new files will cause too many open files to fail.

2020-03-01T04:33:48.356+0800 ffff93b94080 1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:33:53.644+0800 ffff94395080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:33:56.080+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:33:57.100+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:02.880+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:04.664+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:06.528+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:06.540+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:13.560+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:25.280+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:25.868+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:39.980+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:34:48.868+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:08.640+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:23.288+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:28.896+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:32.596+0800 ffff93b94080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:35:57.720+0800 ffff8eb8a080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:36:15.776+0800 ffff94395080 -1 Infiniband recv_cm_meta got error -104: (104) Connection reset by peer
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -
accept open file descriptions limit reached sd = 42 errno 24 (24) Too many open files
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -
accept open file descriptions limit reached sd = 42 errno 24 (24) Too many open files
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -
accept open file descriptions limit reached sd = 42 errno 24 (24) Too many open files
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -
accept open file descriptions limit reached sd = 42 errno 24 (24) Too many open files
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -
accept open file descriptions limit reached sd = 42 errno 24 (24) Too many open files
2020-03-01T04:36:49.996+0800 ffff8eb8a080 -1 Processor -
Proccessor accept has encountered enough error numbers, just do ceph_abort().
/home/chunsong/ceph/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread ffff8eb8a080 time 2020-03-01T04:36:50.000291+0800
/home/chunsong/ceph/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called")
ceph version 15.1.0-16-g9bfc37687b (9bfc37687b6645bb38e1a9f3da81148c09c19a28) octopus (rc)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe4) [0xffff98b3a924]
2: (Processor::accept()+0x96c) [0xffff98de861c]
3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x51c) [0xffff98e38d54]
4: (()+0x5ac080) [0xffff98e41080]
5: (std::function<void ()>::operator()() const+0x34) [0xaaaadc818780]
6: (void std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&)+0x1c) [0xaaaadcb3bd24]
7: (std::__invoke_result<std::function<void ()>>::type std::__invoke<std::function<void ()>>(std::function<void ()>&&)+0x38) [0xaaaadcb3bca8]
8: (void std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>)+0x20) [0xaaaadcb3bc30]
9: (std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()()+0x28) [0xaaaadcb3bbe8]
10: (std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run()+0x18) [0xaaaadcb3bbb4]
11: (()+0xc9ed4) [0xffff985a7ed4]
12: (()+0x7088) [0xffff986fe088]
2020-03-01T04:36:50.004+0800 ffff8eb8a080 -1 /home/chunsong/ceph/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread ffff8eb8a080 time 2020-03-01T04:36:50.000291+0800
/home/chunsong/ceph/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called")

ceph version 15.1.0-16-g9bfc37687b (9bfc37687b6645bb38e1a9f3da81148c09c19a28) octopus (rc)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&)+0xe4) [0xffff98b3a924]
2: (Processor::accept()+0x96c) [0xffff98de861c]
3: (EventCenter::process_events(unsigned int, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >*)+0x51c) [0xffff98e38d54]
4: (()+0x5ac080) [0xffff98e41080]
5: (std::function&lt;void ()&gt;::operator()() const+0x34) [0xaaaadc818780]
6: (void std::__invoke_impl&lt;void, std::function&lt;void ()&gt;>(std::__invoke_other, std::function&lt;void ()&gt;&&)+0x1c) [0xaaaadcb3bd24]
7: (std::__invoke_result&lt;std::function&lt;void ()&gt;>::type std::__invoke&lt;std::function&lt;void ()&gt;>(std::function&lt;void ()&gt;&&)+0x38) [0xaaaadcb3bca8]
8: (void std::thread::_Invoker&lt;std::tuple&lt;std::function&lt;void ()&gt; > >::_M_invoke&lt;0ul&gt;(std::_Index_tuple&lt;0ul&gt;)+0x20) [0xaaaadcb3bc30]
9: (std::thread::_Invoker&lt;std::tuple&lt;std::function&lt;void ()&gt; > >::operator()()+0x28) [0xaaaadcb3bbe8]
10: (std::thread::_State_impl&lt;std::thread::_Invoker&lt;std::tuple&lt;std::function&lt;void ()&gt; > > >::_M_run()+0x18) [0xaaaadcb3bbb4]
11: (()+0xc9ed4) [0xffff985a7ed4]
12: (()+0x7088) [0xffff986fe088]
  • Caught signal (Aborted) **
    in thread ffff8eb8a080 thread_name:msgr-worker-0
    ceph version 15.1.0-16-g9bfc37687b (9bfc37687b6645bb38e1a9f3da81148c09c19a28) octopus (rc)
    1: (__kernel_rt_sigreturn()+0) [0xffffa17f75c0]
    2: (raise()+0xb0) [0xffff982d94d8]
    terminate called after throwing an instance of 'std::runtime_error'
    what(): random_device::random_device(const std::string&)
    root@node1:~/ssd#
    [1]+ Aborted (core dumped) /usr/bin/ceph-mon -f --cluster ceph --id node1 --setuser ceph --setgroup ceph (wd: /MLNX_OFED_LINUX-4.7-1.0.0.1-ubuntu18.04-aarch64)
    (wd now: ~/ssd)
    root@node1:
    /ssd# ulimit -a
    core file size (blocks, -c) 10240
    data seg size (kbytes, -d) unlimited
    scheduling priority (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) 2057959
    max locked memory (kbytes, -l) unlimited
    max memory size (kbytes, -m) unlimited
    open files (-n) 65536
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) 819200
    real-time priority (-r) 0
    stack size (kbytes, -s) 8192
    cpu time (seconds, -t) unlimited
    max user processes (-u) 2057959
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited
    root@node1:~/ssd# ceph-conf -D | grep ms_max_accept_failures
    ms_max_accept_failures = 4

Also available in: Atom PDF