Actions
Bug #47499
openSimultaneous MDS and OSD crashes when answering to client
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We observed 4 MDSes and 2 OSDs segfaulting simultaneously when answering to one client. All the six tracebacks report a connection problem when answering to the client and error bad tag 50.
We have the coredumps and can inspect anything needed to help debug.
From an MDS log:
2020-09-10 13:22:47.693801 7fad2ce4b700 0 -- 188.184.103.149:6800/4048826873 >> 137.138.62.184:0/470651361 conn(0x55c50234b000 :6800 s=STATE_OPEN pgs=670662 cs=1 l=0).process bad tag 50
2020-09-10 13:22:47.727865 7fad2d64c700 0 -- 188.184.103.149:6800/4048826873 >> 137.138.62.184:0/470651361 conn(0x55c515d50000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
2020-09-10 13:22:47.748509 7fad2d64c700 -1 *** Caught signal (Segmentation fault) **
in thread 7fad2d64c700 thread_name:msgr-worker-1
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
1: (()+0x5dcb31) [0x55c4d3a11b31]
2: (()+0xf630) [0x7fad2ff20630]
3: (ceph::buffer::ptr::copy_out(unsigned int, unsigned int, char*) const+0x90) [0x55c4d3a14000]
4: (ceph::buffer::list::iterator_impl<false>::copy(unsigned int, char*)+0x4f) [0x55c4d3a1bbcf]
5: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketInfo&, std::unique_ptr<AuthAuthorizerChallenge, std::default_delete<AuthAuthorizerChallenge> >*, ceph::buffer::list&)+0x115) [0x55c4d3ccb455]
6: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCapsInfo&, CryptoKey&, unsigned long*, std::unique_ptr<AuthAuthorizerChallenge, std::default_delete<AuthAuthorizerChallenge> >*)+0x216) [0x55c4d3cc01a6]
7: (MDSDaemon::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&, std::unique_ptr<AuthAuthorizerChallenge, std::default_delete<AuthAuthorizerChallenge> >*)+0x189) [0x55c4d36fce99]
8: (AsyncConnection::handle_connect_msg(ceph_msg_connect&, ceph::buffer::list&, ceph::buffer::list&)+0x4ba) [0x55c4d3d8c61a]
9: (AsyncConnection::_process_connection()+0x1e97) [0x55c4d3d921b7]
10: (AsyncConnection::process()+0x768) [0x55c4d3d95bb8]
11: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x359) [0x55c4d3aec289]
12: (()+0x6b9e4e) [0x55c4d3aeee4e]
13: (()+0xb5070) [0x7fad2f890070]
14: (()+0x7ea5) [0x7fad2ff18ea5]
15: (clone()+0x6d) [0x7fad2eff38dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
From the core dump:
#0 0x00007fad2ff204fb in raise () from /lib64/libpthread.so.0
#1 0x000055c4d3a11c06 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 maybe_inline_memcpy (inline_len=8, l=1, src=0x229475be4cbf96d9, dest=0x7fad2d64976a) at /usr/src/debug/ceph-12.2.12/src/include/inline_memory.h:40
#5 ceph::buffer::ptr::copy_out (this=<optimized out>, o=<optimized out>, l=<optimized out>, dest=0x7fad2d64976a "\364.\255\177") at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:1039
#6 0x000055c4d3a1bbcf in ceph::buffer::list::iterator_impl<false>::copy (this=0x7fad2d649bb0, len=<optimized out>, dest=0x7fad2d64976a "\364.\255\177")
at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:1248
#7 0x000055c4d3ccb455 in decode_raw<unsigned char> (p=..., t=@0x7fad2d64976a: 244 '\364') at /usr/src/debug/ceph-12.2.12/src/include/encoding.h:74
#8 decode (p=..., v=@0x7fad2d64976a: 244 '\364') at /usr/src/debug/ceph-12.2.12/src/include/encoding.h:81
#9 cephx_verify_authorizer (cct=cct@entry=0x55c4dd89e1c0, keys=keys@entry=0x55c4ddb01130, indata=..., ticket_info=..., challenge=challenge@entry=0x55c50234c5f0, reply_bl=...)
at /usr/src/debug/ceph-12.2.12/src/auth/cephx/CephxProtocol.cc:407
#10 0x000055c4d3cc01a6 in CephxAuthorizeHandler::verify_authorizer (this=<optimized out>, cct=0x55c4dd89e1c0, keys=0x55c4ddb01130, authorizer_data=..., authorizer_reply=..., entity_name=...,
global_id=@0x7fad2d649ec8: 0, caps_info=..., session_key=..., auid=0x0, challenge=0x55c50234c5f0) at /usr/src/debug/ceph-12.2.12/src/auth/cephx/CephxAuthorizeHandler.cc:26
#11 0x000055c4d36fce99 in MDSDaemon::ms_verify_authorizer (this=0x55c4ddb04000, con=0x55c50234b000, peer_type=<optimized out>, protocol=<optimized out>, authorizer_data=..., authorizer_reply=...,
is_valid=@0x7fad2d64a33f: false, session_key=..., challenge=0x55c50234c5f0) at /usr/src/debug/ceph-12.2.12/src/mds/MDSDaemon.cc:1338
#12 0x000055c4d3d8c61a in ms_deliver_verify_authorizer (challenge=0x55c50234c5f0, session_key=..., isvalid=@0x7fad2d64a33f: false, authorizer_reply=..., authorizer=..., protocol=2, peer_type=8,
con=0x55c50234b000, this=0x55c4dda02000) at /usr/src/debug/ceph-12.2.12/src/msg/Messenger.h:814
#13 verify_authorizer (challenge=0x55c50234c5f0, session_key=..., isvalid=@0x7fad2d64a33f: false, auth_reply=..., auth=..., protocol=2, peer_type=8, con=0x55c50234b000, this=0x55c4dda02000)
at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncMessenger.h:390
#14 AsyncConnection::handle_connect_msg (this=this@entry=0x55c50234b000, connect=..., authorizer_bl=..., authorizer_reply=...) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1539
#15 0x000055c4d3d921b7 in AsyncConnection::_process_connection (this=this@entry=0x55c50234b000) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1341
#16 0x000055c4d3d95bb8 in AsyncConnection::process (this=0x55c50234b000) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#17 0x000055c4d3aec289 in EventCenter::process_events (this=this@entry=0x55c4dd896c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7fad2d64b0f0) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#18 0x000055c4d3aeee4e in NetworkStack::__lambda4::operator() (__closure=0x55c4dd8800d0) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#19 0x00007fad2f890070 in ?? () from /lib64/libstdc++.so.6
#20 0x00007fad2ff18ea5 in start_thread () from /lib64/libpthread.so.0
#21 0x00007fad2eff38dd in clone () from /lib64/libc.so.6
Investigating with gdb, this seems related to a bogus memory address at:
#5 ceph::buffer::ptr::copy_out (this=<optimized out>, o=<optimized out>, l=<optimized out>, dest=0x7fad2d64976a "\364.\255\177") at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:1039
1039 maybe_inline_memcpy(dest, src, l, 8);
(gdb) p src
$4 = 0x229475be4cbf96d9 <Address 0x229475be4cbf96d9 out of bounds>
Updated by Dan van der Ster over 3 years ago
It just happened again on a different MDS with a different client and I found something in common. In all the crashes we have ".handle_connect_msg stop myself to swap existing" as the previous ms log just in the crashing thread.
cephflax-mds-31fab05c63.cern.ch: 2020-09-10 13:22:47.584943 7f766c5f7700 -1 *** Caught signal (Segmentation fault) ** -7> 2020-09-10 13:22:47.580041 7f766c5f7700 1 -- 188.184.102.153:6800/1608200344 >> 137.138.62.184:0/378194776 conn(0x557b67d36000 :6800 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing 0> 2020-09-10 13:22:47.584943 7f766c5f7700 -1 *** Caught signal (Segmentation fault) ** cephflax-mds-4180953653.cern.ch: 2020-09-10 13:22:47.748509 7fad2d64c700 -1 *** Caught signal (Segmentation fault) ** -8> 2020-09-10 13:22:47.727895 7fad2d64c700 1 -- 188.184.103.149:6800/4048826873 >> 137.138.62.184:0/470651361 conn(0x55c515d50000 :6800 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing 0> 2020-09-10 13:22:47.748509 7fad2d64c700 -1 *** Caught signal (Segmentation fault) ** cephflax-mds-ab1cb9e2f4.cern.ch: 2020-09-10 13:23:52.877931 7f54670f9700 -1 *** Caught signal (Segmentation fault) ** -10> 2020-09-10 13:23:52.867246 7f54670f9700 1 -- 137.138.159.168:6800/1504920787 >> 137.138.62.184:0/470651361 conn(0x55f59d8cd000 :6800 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing 0> 2020-09-10 13:23:52.877931 7f54670f9700 -1 *** Caught signal (Segmentation fault) ** cephflax-mds-cb1958edb2.cern.ch: 2020-09-18 18:08:55.219653 7fdd41529700 -1 *** Caught signal (Segmentation fault) ** -10> 2020-09-18 18:08:55.210610 7fdd41529700 1 -- 188.184.103.143:6800/2237929101 >> 137.138.123.117:0/1719469613 conn(0x563d9439a800 :6800 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing 0> 2020-09-18 18:08:55.219653 7fdd41529700 -1 *** Caught signal (Segmentation fault) **
Actions