Project

General

Profile

Actions

Bug #42583

open

msgr2 sometimes crashes during shutdown

Added by Paul Emmerich over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm running a large number of radosgw-admin processes and I'm seeing crashes during shutdown, it's very rare (seen 3 crashes in 1 million runs?)

Log filtered to the crashing thread and the last few messages

   -86> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): get_auth_request con 0x5617d998ea40 auth_method 0
   -85> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): get_auth_request method 2 preferred_modes [2,1]
   -84> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): _init_auth method 2
(...)
   -80> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): handle_auth_reply_more payload 9
   -79> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): handle_auth_reply_more payload_len 9
   -78> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): handle_auth_reply_more responding with 36 bytes
(...)
   -72> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient(hunting): handle_auth_done global_id 69640371 payload 931
   -71> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient: _finish_hunting 0
   -70> 2019-10-31 12:09:11.121 7fb4d5089700  1 monclient: found mon.noname-a
   -69> 2019-10-31 12:09:11.121 7fb4d5089700 10 monclient: _send_mon_message to mon.noname-a at v2:[XXXX]:3300/0
(...)
    -9> 2019-10-31 12:09:11.147 7fb495fe3700  1 RGWAsyncRadosProcessor::m_tp worker finish
    -8> 2019-10-31 12:09:11.147 7fb4957e2700  1 RGWAsyncRadosProcessor::m_tp worker finish
    -7> 2019-10-31 12:09:11.148 7fb4e884a6c0  5 asok(0x5617d97dc910) unregister_commands sync trace active
    -6> 2019-10-31 12:09:11.148 7fb4e884a6c0  5 asok(0x5617d97dc910) unregister_commands sync trace active_short
    -5> 2019-10-31 12:09:11.148 7fb4e884a6c0  5 asok(0x5617d97dc910) unregister_commands sync trace history
    -4> 2019-10-31 12:09:11.148 7fb4e884a6c0  5 asok(0x5617d97dc910) unregister_commands sync trace show
    -3> 2019-10-31 12:09:11.148 7fb4e884a6c0  5 asok(0x5617d97dc910) unregister_command cr dump
    -2> 2019-10-31 12:09:14.948 7fb4beffd700 -1 RGWWatcher::handle_error cookie 94660434821984 err (107) Transport endpoint is not connected
    -1> 2019-10-31 12:09:15.972 7fb4cd7fa700 10 monclient: tick
     0> 2019-10-31 12:09:19.410 7fb4d5089700 -1 *** Caught signal (Aborted) **
 in thread 7fb4d5089700 thread_name:msgr-worker-2

 ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus (stable)
 1: (()+0xf5d0) [0x7fb4de6575d0]
 2: (gsignal()+0x37) [0x7fb4db53d2c7]
 3: (abort()+0x148) [0x7fb4db53e9b8]
 4: (()+0x78e17) [0x7fb4db57fe17]
 5: (()+0x81609) [0x7fb4db588609]
 6: (()+0x8dd8a) [0x7fb4e8192d8a]
 7: (()+0x8cc52) [0x7fb4e8191c52]
 8: (()+0xc029a) [0x7fb4e81c529a]
 9: (()+0xc788c) [0x7fb4e81cc88c]
 10: (()+0xc8ad3) [0x7fb4e81cdad3]
 11: (()+0xe3ff2) [0x7fb4e81e8ff2]
 12: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x676) [0x7fb4deca6cb6]
 13: (ProtocolV2::handle_message()+0x9b6) [0x7fb4ded98de6]
 14: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x7fb4dedac950]
 15: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x178) [0x7fb4dedacbb8]
 16: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x7fb4ded91ca4]
 17: (AsyncConnection::process()+0x186) [0x7fb4ded5f696]
 18: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x7bd) [0x7fb4dedb5a7d]
 19: (()+0x5567e5) [0x7fb4dedba7e5]
 20: (()+0x7ebb6f) [0x7fb4df04fb6f]
 21: (()+0x7dd5) [0x7fb4de64fdd5]
 22: (clone()+0x6d) [0x7fb4db60502d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Running CentOS 7.6 and an IPv6-only setup. Looks like there are 3.5 seconds of doing nothing between the error about "Transport endpoint not connected" and the crash.

I've seen this crash 3 times, it only happens during shutdown and there's always the message about "Transport endpoint not connected" when it crashes. (But I sometimes get that error and it doesn't crash)

Unfortunately not my system and I can't install debug symbols on it at the moment because it got Ceph from mirrorlist.centos.org which doesn't seem to have the debuginfo package?

Actions #1

Updated by Brad Hubbard over 4 years ago

Possibly related to #42026 ?

Actions

Also available in: Atom PDF