Actions
Bug #37910
closedsegv during crc of incoming message front
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
-207> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).handle_message_front r=0 -206> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).handle_message_front got front 659621 -205> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).read_message_middle -204> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).read_message_data_prepare -203> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).read_message_data msg_left=0 -202> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).read_message_footer -201> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 msgr2=0x5615360f7600 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=21 -200> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_FOOTER_AND_DISPATCH pgs=165 cs=23 l=0).handle_message_footer r=0 -199> 2019-01-14 00:29:22.245 7f6ea2dad700 10 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_FOOTER_AND_DISPATCH pgs=165 cs=23 l=0).handle_message_footer aborted = 0 -198> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_FOOTER_AND_DISPATCH pgs=165 cs=23 l=0).handle_message_footer got 659621 + 0 + 0 byte message -8> 2019-01-14 00:29:22.250 7f6ea2dad700 -1 *** Caught signal (Segmentation fault) ** in thread 7f6ea2dad700 thread_name:msgr-worker-2 ceph version 14.0.1-2510-g74ba84a (74ba84a6b0855849aee1d2fd3678e94a657542ce) nautilus (dev) 1: (()+0xf6d0) [0x7f6ea887d6d0] 2: (ceph::buffer::list::crc32c(unsigned int) const+0x6b) [0x561528ae4c3b] 3: (decode_message(CephContext*, int, ceph_msg_header&, ceph_msg_footer&, ceph::buffer::list&, ceph::buffer::list&, ceph::buffer::list&, Connection*)+0x2f2) [0x561528a54a42] 4: (ProtocolV2::handle_message_footer(char*, int)+0xeb) [0x561528c704cb] 5: (()+0xf15bdd) [0x561528c68bdd] 6: (AsyncConnection::process()+0x18c) [0x561528c4b2ec] 7: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1585) [0x561528ac44d5]
(gdb) bt #0 0x00007f6ea887d59b in raise () from /lib64/libpthread.so.0 #1 0x00005615288b4f65 in reraise_fatal (signum=11) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/global/signal_handler.cc:81 #2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/global/signal_handler.cc:298 #3 <signal handler called> #4 0x0000561528ae4c3b in test_and_set (__m=std::memory_order_acquire, this=0x50) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/atomic_base.h:176 #5 spin_lock (lock=...) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/include/spinlock.h:50 #6 lock (this=0x50) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/include/spinlock.h:39 #7 lock_guard (__m=..., this=<synthetic pointer>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_mutex.h:162 #8 get_crc (crc=<synthetic pointer>, fromto=<synthetic pointer>, this=0x0) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/include/buffer_raw.h:102 #9 ceph::buffer::list::crc32c (this=this@entry=0x5615360f7848, crc=crc@entry=0) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/common/buffer.cc:1924 #10 0x0000561528a54a42 in decode_message (cct=0x561532bde000, crcflags=3, header=..., footer=..., front=..., middle=..., data=..., conn=0x561533806800) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/Message.cc:297 #11 0x0000561528c704cb in ProtocolV2::handle_message_footer (this=0x5615360f7600, buffer=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/ProtocolV2.cc:915 #12 0x0000561528c68bdd in operator() (r=<optimized out>, buffer=<optimized out>, __closure=0x561533806bb8) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/ProtocolV2.cc:430 #13 std::_Function_handler<void(char*, long int), ProtocolV2::read(CtFun<ProtocolV2, char*, int>*, int, char*)::<lambda(char*, int)> >::_M_invoke(const std::_Any_data &, <unknown type in /usr/lib/debug/usr/bin/ceph-osd.debug, CU 0xc547871, DIE 0xc6208cd>, <unknown type in /usr/lib/debug/usr/bin/ceph-osd.debug, CU 0xc547871, DIE 0xc6208de>) (__functor=..., __args#0=<optimized out>, __args#1=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316 #14 0x0000561528c4b2ec in operator() (__args#1=<optimized out>, __args#0=<optimized out>, this=0x561533806bb8) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:706 #15 AsyncConnection::process (this=0x561533806800) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/AsyncConnection.cc:429 #16 0x0000561528ac44d5 in EventCenter::process_events (this=this@entry=0x5615336c1080, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f6ea2daab60) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/Event.cc:415 #17 0x0000561528ac9c37 in operator() (__closure=0x5615336ae518) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/Stack.cc:53 #18 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316 #19 0x0000561528fcf42f in execute_native_thread_routine () #20 0x00007f6ea8875e25 in start_thread () from /lib64/libpthread.so.0 #21 0x00007f6ea773ebad in clone () from /lib64/libc.so.6
/a/sage-2019-01-13_22:11:18-rados-wip-sage-testing-2019-01-13-0915-distro-basic-smithi/3458928
Updated by Radoslaw Zarzynski over 5 years ago
bt full ... #8 get_crc (crc=<synthetic pointer>, fromto=<synthetic pointer>, this=0x0) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/include/buffer_raw.h:102 No locals. #9 ceph::buffer::list::crc32c (this=this@entry=0x5615360f7848, crc=crc@entry=0) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/common/buffer.cc:1924 ccrc = <optimized out> __for_range = @0x5615360f7848: {_root = {next = 0x561547a4d7a8}, _tail = 0x561547a4d7a8, _size = 1} cache_misses = <optimized out> cache_hits = <optimized out> cache_adjusts = <optimized out> ... (gdb) set print pretty on (gdb) p *(ceph::bufferlist*)0x5615360f7848 $2 = { _buffers = { _root = { next = 0x561547a4d7a8 }, _tail = 0x561547a4d7a8, _size = 1 }, _len = 659621, _memcopy_count = 0, append_buffer = { _raw = 0x0, _off = 0, _len = 0 }, last_p = { <ceph::buffer::list::iterator_impl<false>> = { bl = 0x5615360f7848, ls = 0x5615360f7848, p = { cur = 0x5615360f7848 }, off = 0, p_off = 0 }, <No data fields>}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1 } ... (gdb) p *(ceph::buffer::ptr_node *) 0x561547a4d7a8 $6 = { <ceph::buffer::ptr_hook> = { next = 0x5615360f7848 }, <ceph::buffer::ptr> = { _raw = 0x0, _off = 0, _len = 659621 }, <No data fields>}
Updated by Radoslaw Zarzynski over 5 years ago
(gdb) p ((ProtocolV2*)0x5615360f7600)->current_header $15 = { seq = { v = 3522 }, tid = { v = 0 }, type = { v = 111 }, priority = { v = 127 }, version = { v = 2 }, front_len = { v = 659621 }, middle_len = { v = 0 }, data_len = { v = 0 }, data_off = { v = 0 }, src = { type = 4 '\004', num = { v = 6 } }, compat_version = { v = 1 }, reserved = { v = 0 }, crc = { v = 1449709365 } }
#define MSG_OSD_EC_READ_REPLY 111
Updated by Radoslaw Zarzynski over 5 years ago
Hmm, interesting. The same thread 0x7f6ea2dad700 is handling two instances of AsyncConnection: 0x5615360ef000
and the failing 0x561533806800. Both of them are processing a read with front blob size = 659621 bytes.
$ grep 7f6ea2dad700 ceph-osd.1.log ... -229> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.179:6818/33446,v1:172.21.15.179:6819/33446] conn(0x5615360ef000 msgr2=0x5615360f5200 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).process -227> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.179:6818/33446,v1:172.21.15.179:6819/33446] conn(0x5615360ef000 msgr2=0x5615360f5200 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read continue len=659621 -214> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 msgr2=0x5615360f7600 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).process -212> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 msgr2=0x5615360f7600 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read continue len=659621
Let's take a look on the front bufferlist of the first one (0x5615360ef000) which hasn't segfaulted:
(gdb) p (*(ProtocolV2*)0x5615360f5200).front $8 = { _buffers = { _root = { next = 0x5615464d5568 }, _tail = 0x5615464d5568, _size = 1 }, _len = 659621, _memcopy_count = 0, append_buffer = { _raw = 0x0, _off = 0, _len = 0 }, last_p = { <ceph::buffer::list::iterator_impl<false>> = { bl = 0x5615360f5448, ls = 0x5615360f5448, p = { cur = 0x5615360f5448 }, off = 0, p_off = 0 }, <No data fields>}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1 } (gdb) p *(ceph::buffer::ptr_node *)0x5615464d5568 $9 = { <ceph::buffer::ptr_hook> = { next = 0x5615360f5448 }, <ceph::buffer::ptr> = { _raw = 0x5615464d5560, _off = 0, _len = 659621 }, <No data fields>}
The address of buffer::ptr_node is 0x5615464d5568 while its underling buffer::raw is placed 8 bytes earlier (0x5615464d5560). This means the buffer is hypercombined.
(gdb) p *(ceph::buffer::raw *)0x5615464d5560 $16 = { _vptr.raw = 0x56152951d350 <vtable for ceph::buffer::raw_posix_aligned+16>, bptr_storage = { __data = "HT\017\066\025V\000\000`UMF\025V\000\000\000\000\000\000\245\020\n", __align = {<No data fields>} }, data = 0x561551b80000 "\001\001\022", len = 659621, nref = { <std::__atomic_base<unsigned int>> = { static _S_alignment = 4, _M_i = 1 }, members of std::atomic<unsigned int>: static is_always_lock_free = true }, mempool = 10, last_crc_offset = { first = 18446744073709551615, second = 18446744073709551615 }, last_crc_val = { first = 0, second = 0 }, crc_spinlock = { af = { <std::__atomic_flag_base> = { _M_i = false }, <No data fields>} } }
Let's assume hypercombining happened also for the damaged bufferlist and try the casting as well (0x561547a4d7a8 - 8 = 0x561547a4d7a0):
(gdb) p *(ceph::buffer::raw *) 0x561547a4d7a0 $17 = { _vptr.raw = 0x56152951d350 <vtable for ceph::buffer::raw_posix_aligned+16>, bptr_storage = { __data = "Hx\017\066\025V", '\000' <repeats 14 times>, "\245\020\n", __align = {<No data fields>} }, data = 0x561551270000 "\001\001\022", len = 659621, nref = { <std::__atomic_base<unsigned int>> = { static _S_alignment = 4, _M_i = 1 }, members of std::atomic<unsigned int>: static is_always_lock_free = true }, mempool = 10, last_crc_offset = { first = 18446744073709551615, second = 18446744073709551615 }, last_crc_val = { first = 0, second = 0 }, crc_spinlock = { af = { <std::__atomic_flag_base> = { _M_i = false }, <No data fields>} } }
Huh, this looks quite reasonably. Moreover, the nref == 1.
Updated by Radoslaw Zarzynski over 5 years ago
Putting on shelf in the sake of msgr V2.
Runs on wip-bug-37910 with client failures:Updated by Sage Weil over 5 years ago
- Related to Bug #38023: segv on FileJournal::prepare_entry in bufferlist added
Updated by Sage Weil about 5 years ago
- Status changed from 12 to Resolved
Actions