Project

General

Profile

Actions

Bug #47499

open

Simultaneous MDS and OSD crashes when answering to client

Added by Enrico Bocchi over 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We observed 4 MDSes and 2 OSDs segfaulting simultaneously when answering to one client. All the six tracebacks report a connection problem when answering to the client and error bad tag 50.

We have the coredumps and can inspect anything needed to help debug.

From an MDS log:

2020-09-10 13:22:47.693801 7fad2ce4b700  0 -- 188.184.103.149:6800/4048826873 >> 137.138.62.184:0/470651361 conn(0x55c50234b000 :6800 s=STATE_OPEN pgs=670662 cs=1 l=0).process bad tag 50
2020-09-10 13:22:47.727865 7fad2d64c700  0 -- 188.184.103.149:6800/4048826873 >> 137.138.62.184:0/470651361 conn(0x55c515d50000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
2020-09-10 13:22:47.748509 7fad2d64c700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fad2d64c700 thread_name:msgr-worker-1

 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
 1: (()+0x5dcb31) [0x55c4d3a11b31]
 2: (()+0xf630) [0x7fad2ff20630]
 3: (ceph::buffer::ptr::copy_out(unsigned int, unsigned int, char*) const+0x90) [0x55c4d3a14000]
 4: (ceph::buffer::list::iterator_impl<false>::copy(unsigned int, char*)+0x4f) [0x55c4d3a1bbcf]
 5: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketInfo&, std::unique_ptr<AuthAuthorizerChallenge, std::default_delete<AuthAuthorizerChallenge> >*, ceph::buffer::list&)+0x115) [0x55c4d3ccb455]
 6: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCapsInfo&, CryptoKey&, unsigned long*, std::unique_ptr<AuthAuthorizerChallenge, std::default_delete<AuthAuthorizerChallenge> >*)+0x216) [0x55c4d3cc01a6]
 7: (MDSDaemon::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&, std::unique_ptr<AuthAuthorizerChallenge, std::default_delete<AuthAuthorizerChallenge> >*)+0x189) [0x55c4d36fce99]
 8: (AsyncConnection::handle_connect_msg(ceph_msg_connect&, ceph::buffer::list&, ceph::buffer::list&)+0x4ba) [0x55c4d3d8c61a]
 9: (AsyncConnection::_process_connection()+0x1e97) [0x55c4d3d921b7]
 10: (AsyncConnection::process()+0x768) [0x55c4d3d95bb8]
 11: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x359) [0x55c4d3aec289]
 12: (()+0x6b9e4e) [0x55c4d3aeee4e]
 13: (()+0xb5070) [0x7fad2f890070]
 14: (()+0x7ea5) [0x7fad2ff18ea5]
 15: (clone()+0x6d) [0x7fad2eff38dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

From the core dump:

#0  0x00007fad2ff204fb in raise () from /lib64/libpthread.so.0
#1  0x000055c4d3a11c06 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  maybe_inline_memcpy (inline_len=8, l=1, src=0x229475be4cbf96d9, dest=0x7fad2d64976a) at /usr/src/debug/ceph-12.2.12/src/include/inline_memory.h:40
#5  ceph::buffer::ptr::copy_out (this=<optimized out>, o=<optimized out>, l=<optimized out>, dest=0x7fad2d64976a "\364.\255\177") at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:1039
#6  0x000055c4d3a1bbcf in ceph::buffer::list::iterator_impl<false>::copy (this=0x7fad2d649bb0, len=<optimized out>, dest=0x7fad2d64976a "\364.\255\177")
    at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:1248
#7  0x000055c4d3ccb455 in decode_raw<unsigned char> (p=..., t=@0x7fad2d64976a: 244 '\364') at /usr/src/debug/ceph-12.2.12/src/include/encoding.h:74
#8  decode (p=..., v=@0x7fad2d64976a: 244 '\364') at /usr/src/debug/ceph-12.2.12/src/include/encoding.h:81
#9  cephx_verify_authorizer (cct=cct@entry=0x55c4dd89e1c0, keys=keys@entry=0x55c4ddb01130, indata=..., ticket_info=..., challenge=challenge@entry=0x55c50234c5f0, reply_bl=...)
    at /usr/src/debug/ceph-12.2.12/src/auth/cephx/CephxProtocol.cc:407
#10 0x000055c4d3cc01a6 in CephxAuthorizeHandler::verify_authorizer (this=<optimized out>, cct=0x55c4dd89e1c0, keys=0x55c4ddb01130, authorizer_data=..., authorizer_reply=..., entity_name=..., 
    global_id=@0x7fad2d649ec8: 0, caps_info=..., session_key=..., auid=0x0, challenge=0x55c50234c5f0) at /usr/src/debug/ceph-12.2.12/src/auth/cephx/CephxAuthorizeHandler.cc:26
#11 0x000055c4d36fce99 in MDSDaemon::ms_verify_authorizer (this=0x55c4ddb04000, con=0x55c50234b000, peer_type=<optimized out>, protocol=<optimized out>, authorizer_data=..., authorizer_reply=..., 
    is_valid=@0x7fad2d64a33f: false, session_key=..., challenge=0x55c50234c5f0) at /usr/src/debug/ceph-12.2.12/src/mds/MDSDaemon.cc:1338
#12 0x000055c4d3d8c61a in ms_deliver_verify_authorizer (challenge=0x55c50234c5f0, session_key=..., isvalid=@0x7fad2d64a33f: false, authorizer_reply=..., authorizer=..., protocol=2, peer_type=8, 
    con=0x55c50234b000, this=0x55c4dda02000) at /usr/src/debug/ceph-12.2.12/src/msg/Messenger.h:814
#13 verify_authorizer (challenge=0x55c50234c5f0, session_key=..., isvalid=@0x7fad2d64a33f: false, auth_reply=..., auth=..., protocol=2, peer_type=8, con=0x55c50234b000, this=0x55c4dda02000)
    at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncMessenger.h:390
#14 AsyncConnection::handle_connect_msg (this=this@entry=0x55c50234b000, connect=..., authorizer_bl=..., authorizer_reply=...) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1539
#15 0x000055c4d3d921b7 in AsyncConnection::_process_connection (this=this@entry=0x55c50234b000) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1341
#16 0x000055c4d3d95bb8 in AsyncConnection::process (this=0x55c50234b000) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#17 0x000055c4d3aec289 in EventCenter::process_events (this=this@entry=0x55c4dd896c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, 
    working_dur=working_dur@entry=0x7fad2d64b0f0) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#18 0x000055c4d3aeee4e in NetworkStack::__lambda4::operator() (__closure=0x55c4dd8800d0) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#19 0x00007fad2f890070 in ?? () from /lib64/libstdc++.so.6
#20 0x00007fad2ff18ea5 in start_thread () from /lib64/libpthread.so.0
#21 0x00007fad2eff38dd in clone () from /lib64/libc.so.6

Investigating with gdb, this seems related to a bogus memory address at:

#5  ceph::buffer::ptr::copy_out (this=<optimized out>, o=<optimized out>, l=<optimized out>, dest=0x7fad2d64976a "\364.\255\177") at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:1039
1039        maybe_inline_memcpy(dest, src, l, 8);

(gdb) p src
$4 = 0x229475be4cbf96d9 <Address 0x229475be4cbf96d9 out of bounds>

Actions #1

Updated by Dan van der Ster over 3 years ago

It just happened again on a different MDS with a different client and I found something in common. In all the crashes we have ".handle_connect_msg stop myself to swap existing" as the previous ms log just in the crashing thread.

cephflax-mds-31fab05c63.cern.ch:
2020-09-10 13:22:47.584943 7f766c5f7700 -1 *** Caught signal (Segmentation fault) **
    -7> 2020-09-10 13:22:47.580041 7f766c5f7700  1 -- 188.184.102.153:6800/1608200344 >> 137.138.62.184:0/378194776 conn(0x557b67d36000 :6800 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
     0> 2020-09-10 13:22:47.584943 7f766c5f7700 -1 *** Caught signal (Segmentation fault) **

cephflax-mds-4180953653.cern.ch:
2020-09-10 13:22:47.748509 7fad2d64c700 -1 *** Caught signal (Segmentation fault) **
    -8> 2020-09-10 13:22:47.727895 7fad2d64c700  1 -- 188.184.103.149:6800/4048826873 >> 137.138.62.184:0/470651361 conn(0x55c515d50000 :6800 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
     0> 2020-09-10 13:22:47.748509 7fad2d64c700 -1 *** Caught signal (Segmentation fault) **

cephflax-mds-ab1cb9e2f4.cern.ch:
2020-09-10 13:23:52.877931 7f54670f9700 -1 *** Caught signal (Segmentation fault) **
   -10> 2020-09-10 13:23:52.867246 7f54670f9700  1 -- 137.138.159.168:6800/1504920787 >> 137.138.62.184:0/470651361 conn(0x55f59d8cd000 :6800 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
     0> 2020-09-10 13:23:52.877931 7f54670f9700 -1 *** Caught signal (Segmentation fault) **

cephflax-mds-cb1958edb2.cern.ch:
2020-09-18 18:08:55.219653 7fdd41529700 -1 *** Caught signal (Segmentation fault) **
   -10> 2020-09-18 18:08:55.210610 7fdd41529700  1 -- 188.184.103.143:6800/2237929101 >> 137.138.123.117:0/1719469613 conn(0x563d9439a800 :6800 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
     0> 2020-09-18 18:08:55.219653 7fdd41529700 -1 *** Caught signal (Segmentation fault) **
Actions #2

Updated by Sage Weil almost 3 years ago

  • Project changed from Ceph to CephFS
Actions

Also available in: Atom PDF