Bug #42583
openmsgr2 sometimes crashes during shutdown
0%
Description
I'm running a large number of radosgw-admin processes and I'm seeing crashes during shutdown, it's very rare (seen 3 crashes in 1 million runs?)
Log filtered to the crashing thread and the last few messages
-86> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): get_auth_request con 0x5617d998ea40 auth_method 0 -85> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): get_auth_request method 2 preferred_modes [2,1] -84> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): _init_auth method 2 (...) -80> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): handle_auth_reply_more payload 9 -79> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): handle_auth_reply_more payload_len 9 -78> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes (...) -72> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): handle_auth_done global_id 69640371 payload 931 -71> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient: _finish_hunting 0 -70> 2019-10-31 12:09:11.121 7fb4d5089700 1 monclient: found mon.noname-a -69> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient: _send_mon_message to mon.noname-a at v2:[XXXX]:3300/0 (...) -9> 2019-10-31 12:09:11.147 7fb495fe3700 1 RGWAsyncRadosProcessor::m_tp worker finish -8> 2019-10-31 12:09:11.147 7fb4957e2700 1 RGWAsyncRadosProcessor::m_tp worker finish -7> 2019-10-31 12:09:11.148 7fb4e884a6c0 5 asok(0x5617d97dc910) unregister_commands sync trace active -6> 2019-10-31 12:09:11.148 7fb4e884a6c0 5 asok(0x5617d97dc910) unregister_commands sync trace active_short -5> 2019-10-31 12:09:11.148 7fb4e884a6c0 5 asok(0x5617d97dc910) unregister_commands sync trace history -4> 2019-10-31 12:09:11.148 7fb4e884a6c0 5 asok(0x5617d97dc910) unregister_commands sync trace show -3> 2019-10-31 12:09:11.148 7fb4e884a6c0 5 asok(0x5617d97dc910) unregister_command cr dump -2> 2019-10-31 12:09:14.948 7fb4beffd700 -1 RGWWatcher::handle_error cookie 94660434821984 err (107) Transport endpoint is not connected -1> 2019-10-31 12:09:15.972 7fb4cd7fa700 10 monclient: tick 0> 2019-10-31 12:09:19.410 7fb4d5089700 -1 *** Caught signal (Aborted) ** in thread 7fb4d5089700 thread_name:msgr-worker-2 ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus (stable) 1: (()+0xf5d0) [0x7fb4de6575d0] 2: (gsignal()+0x37) [0x7fb4db53d2c7] 3: (abort()+0x148) [0x7fb4db53e9b8] 4: (()+0x78e17) [0x7fb4db57fe17] 5: (()+0x81609) [0x7fb4db588609] 6: (()+0x8dd8a) [0x7fb4e8192d8a] 7: (()+0x8cc52) [0x7fb4e8191c52] 8: (()+0xc029a) [0x7fb4e81c529a] 9: (()+0xc788c) [0x7fb4e81cc88c] 10: (()+0xc8ad3) [0x7fb4e81cdad3] 11: (()+0xe3ff2) [0x7fb4e81e8ff2] 12: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x676) [0x7fb4deca6cb6] 13: (ProtocolV2::handle_message()+0x9b6) [0x7fb4ded98de6] 14: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x7fb4dedac950] 15: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x178) [0x7fb4dedacbb8] 16: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x7fb4ded91ca4] 17: (AsyncConnection::process()+0x186) [0x7fb4ded5f696] 18: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x7bd) [0x7fb4dedb5a7d] 19: (()+0x5567e5) [0x7fb4dedba7e5] 20: (()+0x7ebb6f) [0x7fb4df04fb6f] 21: (()+0x7dd5) [0x7fb4de64fdd5] 22: (clone()+0x6d) [0x7fb4db60502d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Running CentOS 7.6 and an IPv6-only setup. Looks like there are 3.5 seconds of doing nothing between the error about "Transport endpoint not connected" and the crash.
I've seen this crash 3 times, it only happens during shutdown and there's always the message about "Transport endpoint not connected" when it crashes. (But I sometimes get that error and it doesn't crash)
Unfortunately not my system and I can't install debug symbols on it at the moment because it got Ceph from mirrorlist.centos.org which doesn't seem to have the debuginfo package?