Project

General

Profile

Actions

Bug #37910

closed

segv during crc of incoming message front

Added by Sage Weil over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  -207> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).handle_message_front r=0
  -206> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).handle_message_front got front 659621
  -205> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).read_message_middle
  -204> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).read_message_data_prepare
  -203> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).read_message_data msg_left=0
  -202> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_MESSAGE_FRONT pgs=165 cs=23 l=0).read_message_footer
  -201> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 msgr2=0x5615360f7600 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=21
  -200> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_FOOTER_AND_DISPATCH pgs=165 cs=23 l=0).handle_message_footer r=0
  -199> 2019-01-14 00:29:22.245 7f6ea2dad700 10 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_FOOTER_AND_DISPATCH pgs=165 cs=23 l=0).handle_message_footer aborted = 0
  -198> 2019-01-14 00:29:22.245 7f6ea2dad700 20 --2- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 0x5615360f7600 :-1 s=READ_FOOTER_AND_DISPATCH pgs=165 cs=23 l=0).handle_message_footer got 659621 + 0 + 0 byte message
    -8> 2019-01-14 00:29:22.250 7f6ea2dad700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f6ea2dad700 thread_name:msgr-worker-2

 ceph version 14.0.1-2510-g74ba84a (74ba84a6b0855849aee1d2fd3678e94a657542ce) nautilus (dev)
 1: (()+0xf6d0) [0x7f6ea887d6d0]
 2: (ceph::buffer::list::crc32c(unsigned int) const+0x6b) [0x561528ae4c3b]
 3: (decode_message(CephContext*, int, ceph_msg_header&, ceph_msg_footer&, ceph::buffer::list&, ceph::buffer::list&, ceph::buffer::list&, Connection*)+0x2f2) [0x561528a54a42]
 4: (ProtocolV2::handle_message_footer(char*, int)+0xeb) [0x561528c704cb]
 5: (()+0xf15bdd) [0x561528c68bdd]
 6: (AsyncConnection::process()+0x18c) [0x561528c4b2ec]
 7: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1585) [0x561528ac44d5]

(gdb) bt
#0  0x00007f6ea887d59b in raise () from /lib64/libpthread.so.0
#1  0x00005615288b4f65 in reraise_fatal (signum=11) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=11) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/global/signal_handler.cc:298
#3  <signal handler called>
#4  0x0000561528ae4c3b in test_and_set (__m=std::memory_order_acquire, this=0x50) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/atomic_base.h:176
#5  spin_lock (lock=...) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/include/spinlock.h:50
#6  lock (this=0x50) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/include/spinlock.h:39
#7  lock_guard (__m=..., this=<synthetic pointer>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_mutex.h:162
#8  get_crc (crc=<synthetic pointer>, fromto=<synthetic pointer>, this=0x0) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/include/buffer_raw.h:102
#9  ceph::buffer::list::crc32c (this=this@entry=0x5615360f7848, crc=crc@entry=0) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/common/buffer.cc:1924
#10 0x0000561528a54a42 in decode_message (cct=0x561532bde000, crcflags=3, header=..., footer=..., front=..., middle=..., data=..., conn=0x561533806800) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/Message.cc:297
#11 0x0000561528c704cb in ProtocolV2::handle_message_footer (this=0x5615360f7600, buffer=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/ProtocolV2.cc:915
#12 0x0000561528c68bdd in operator() (r=<optimized out>, buffer=<optimized out>, __closure=0x561533806bb8) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/ProtocolV2.cc:430
#13 std::_Function_handler<void(char*, long int), ProtocolV2::read(CtFun<ProtocolV2, char*, int>*, int, char*)::<lambda(char*, int)> >::_M_invoke(const std::_Any_data &, <unknown type in /usr/lib/debug/usr/bin/ceph-osd.debug, CU 0xc547871, DIE 0xc6208cd>, <unknown type in /usr/lib/debug/usr/bin/ceph-osd.debug, CU 0xc547871, DIE 0xc6208de>) (__functor=..., __args#0=<optimized out>, __args#1=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316
#14 0x0000561528c4b2ec in operator() (__args#1=<optimized out>, __args#0=<optimized out>, this=0x561533806bb8) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:706
#15 AsyncConnection::process (this=0x561533806800) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/AsyncConnection.cc:429
#16 0x0000561528ac44d5 in EventCenter::process_events (this=this@entry=0x5615336c1080, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f6ea2daab60) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/Event.cc:415
#17 0x0000561528ac9c37 in operator() (__closure=0x5615336ae518) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/msg/async/Stack.cc:53
#18 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316
#19 0x0000561528fcf42f in execute_native_thread_routine ()
#20 0x00007f6ea8875e25 in start_thread () from /lib64/libpthread.so.0
#21 0x00007f6ea773ebad in clone () from /lib64/libc.so.6

/a/sage-2019-01-13_22:11:18-rados-wip-sage-testing-2019-01-13-0915-distro-basic-smithi/3458928

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #38023: segv on FileJournal::prepare_entry in bufferlistClosedRadoslaw Zarzynski01/23/2019

Actions
Actions #1

Updated by Radoslaw Zarzynski over 5 years ago

bt full
...
#8  get_crc (crc=<synthetic pointer>, fromto=<synthetic pointer>, this=0x0) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/include/buffer_raw.h:102
No locals.
#9  ceph::buffer::list::crc32c (this=this@entry=0x5615360f7848, crc=crc@entry=0) at /usr/src/debug/ceph-14.0.1-2510-g74ba84a/src/common/buffer.cc:1924
        ccrc = <optimized out>
        __for_range = @0x5615360f7848: {_root = {next = 0x561547a4d7a8}, _tail = 0x561547a4d7a8, _size = 1}
        cache_misses = <optimized out>
        cache_hits = <optimized out>
        cache_adjusts = <optimized out>
...
(gdb) set print pretty on
(gdb) p *(ceph::bufferlist*)0x5615360f7848
$2 = {
  _buffers = {
    _root = {
      next = 0x561547a4d7a8
    }, 
    _tail = 0x561547a4d7a8, 
    _size = 1
  }, 
  _len = 659621, 
  _memcopy_count = 0, 
  append_buffer = {
    _raw = 0x0, 
    _off = 0, 
    _len = 0
  }, 
  last_p = {
    <ceph::buffer::list::iterator_impl<false>> = {
      bl = 0x5615360f7848, 
      ls = 0x5615360f7848, 
      p = {
        cur = 0x5615360f7848
      }, 
      off = 0, 
      p_off = 0
    }, <No data fields>}, 
  static CLAIM_DEFAULT = 0, 
  static CLAIM_ALLOW_NONSHAREABLE = 1
}
...
(gdb) p *(ceph::buffer::ptr_node *) 0x561547a4d7a8
$6 = {
  <ceph::buffer::ptr_hook> = {
    next = 0x5615360f7848
  }, 
  <ceph::buffer::ptr> = {
    _raw = 0x0, 
    _off = 0, 
    _len = 659621
  }, <No data fields>}

Actions #2

Updated by Radoslaw Zarzynski over 5 years ago

(gdb) p ((ProtocolV2*)0x5615360f7600)->current_header
$15 = {
  seq = {
    v = 3522
  }, 
  tid = {
    v = 0
  }, 
  type = {
    v = 111
  }, 
  priority = {
    v = 127
  }, 
  version = {
    v = 2
  }, 
  front_len = {
    v = 659621
  }, 
  middle_len = {
    v = 0
  }, 
  data_len = {
    v = 0
  }, 
  data_off = {
    v = 0
  }, 
  src = {
    type = 4 '\004', 
    num = {
      v = 6
    }
  }, 
  compat_version = {
    v = 1
  }, 
  reserved = {
    v = 0
  }, 
  crc = {
    v = 1449709365
  }
}
#define MSG_OSD_EC_READ_REPLY  111
Actions #3

Updated by Radoslaw Zarzynski over 5 years ago

Hmm, interesting. The same thread 0x7f6ea2dad700 is handling two instances of AsyncConnection: 0x5615360ef000
and the failing 0x561533806800. Both of them are processing a read with front blob size = 659621 bytes.

$ grep 7f6ea2dad700 ceph-osd.1.log
...
  -229> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.179:6818/33446,v1:172.21.15.179:6819/33446] conn(0x5615360ef000 msgr2=0x5615360f5200 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).process
  -227> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.179:6818/33446,v1:172.21.15.179:6819/33446] conn(0x5615360ef000 msgr2=0x5615360f5200 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read continue len=659621
  -214> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 msgr2=0x5615360f7600 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).process
  -212> 2019-01-14 00:29:22.245 7f6ea2dad700 20 -- [v2:172.21.15.62:6802/33804,v1:172.21.15.62:6803/33804] >> [v2:172.21.15.26:6828/34392,v1:172.21.15.26:6829/34392] conn(0x561533806800 msgr2=0x5615360f7600 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read continue len=659621

Let's take a look on the front bufferlist of the first one (0x5615360ef000) which hasn't segfaulted:

(gdb) p (*(ProtocolV2*)0x5615360f5200).front
$8 = {
  _buffers = {
    _root = {
      next = 0x5615464d5568
    }, 
    _tail = 0x5615464d5568, 
    _size = 1
  }, 
  _len = 659621, 
  _memcopy_count = 0, 
  append_buffer = {
    _raw = 0x0, 
    _off = 0, 
    _len = 0
  }, 
  last_p = {
    <ceph::buffer::list::iterator_impl<false>> = {
      bl = 0x5615360f5448, 
      ls = 0x5615360f5448, 
      p = {
        cur = 0x5615360f5448
      }, 
      off = 0, 
      p_off = 0
    }, <No data fields>}, 
  static CLAIM_DEFAULT = 0, 
  static CLAIM_ALLOW_NONSHAREABLE = 1
}
(gdb) p *(ceph::buffer::ptr_node *)0x5615464d5568
$9 = {
  <ceph::buffer::ptr_hook> = {
    next = 0x5615360f5448
  }, 
  <ceph::buffer::ptr> = {
    _raw = 0x5615464d5560, 
    _off = 0, 
    _len = 659621
  }, <No data fields>}

The address of buffer::ptr_node is 0x5615464d5568 while its underling buffer::raw is placed 8 bytes earlier (0x5615464d5560). This means the buffer is hypercombined.

(gdb) p *(ceph::buffer::raw *)0x5615464d5560
$16 = {
  _vptr.raw = 0x56152951d350 <vtable for ceph::buffer::raw_posix_aligned+16>, 
  bptr_storage = {
    __data = "HT\017\066\025V\000\000`UMF\025V\000\000\000\000\000\000\245\020\n", 
    __align = {<No data fields>}
  }, 
  data = 0x561551b80000 "\001\001\022", 
  len = 659621, 
  nref = {
    <std::__atomic_base<unsigned int>> = {
      static _S_alignment = 4, 
      _M_i = 1
    }, 
    members of std::atomic<unsigned int>: 
    static is_always_lock_free = true
  }, 
  mempool = 10, 
  last_crc_offset = {
    first = 18446744073709551615, 
    second = 18446744073709551615
  }, 
  last_crc_val = {
    first = 0, 
    second = 0
  }, 
  crc_spinlock = {
    af = {
      <std::__atomic_flag_base> = {
        _M_i = false
      }, <No data fields>}
  }
}

Let's assume hypercombining happened also for the damaged bufferlist and try the casting as well (0x561547a4d7a8 - 8 = 0x561547a4d7a0):

(gdb) p *(ceph::buffer::raw *) 0x561547a4d7a0
$17 = {
  _vptr.raw = 0x56152951d350 <vtable for ceph::buffer::raw_posix_aligned+16>, 
  bptr_storage = {
    __data = "Hx\017\066\025V", '\000' <repeats 14 times>, "\245\020\n", 
    __align = {<No data fields>}
  }, 
  data = 0x561551270000 "\001\001\022", 
  len = 659621, 
  nref = {
    <std::__atomic_base<unsigned int>> = {
      static _S_alignment = 4, 
      _M_i = 1
    }, 
    members of std::atomic<unsigned int>: 
    static is_always_lock_free = true
  }, 
  mempool = 10, 
  last_crc_offset = {
    first = 18446744073709551615, 
    second = 18446744073709551615
  }, 
  last_crc_val = {
    first = 0, 
    second = 0
  }, 
  crc_spinlock = {
    af = {
      <std::__atomic_flag_base> = {
        _M_i = false
      }, <No data fields>}
  }
}

Huh, this looks quite reasonably. Moreover, the nref == 1.

Actions #5

Updated by Sage Weil over 5 years ago

  • Related to Bug #38023: segv on FileJournal::prepare_entry in bufferlist added
Actions #6

Updated by Sage Weil about 5 years ago

  • Status changed from 12 to Resolved
Actions

Also available in: Atom PDF