Project

General

Profile

Bug #38345

mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup

Added by Sage Weil over 1 year ago. Updated 5 months ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature:

Description

2019-02-15 09:48:45.381 7f30cf9fd700  1 -- v1:172.21.15.148:6789/0 <== mon.1 v1:172.21.15.102:6789/0 2 ==== election(8ec471df-d03c-4d8b-9fa6-53fd073b70cc propose 11) v7 ==== 435+0+0 (539273810 0 0) 0x3cd5e40 con 0x437cd80
...
2019-02-15 09:48:45.389 7f30cf9fd700 20 -- v1:172.21.15.148:6789/0 done calling dispatch on 0x3cd5e40
...
2019-02-15 09:49:57.145 7f30d3204700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f30d3204700 thread_name:OpHistorySvc

 ceph version 14.0.1-3738-gca5ec13 (ca5ec139e4b98becd397a459018a659a332bc291) nautilus (dev)
 1: (()+0x11390) [0x7f30db1ad390]
 2: (RefCountedObject::put() const+0x44) [0x6d6014]
 3: (MonOpRequest::~MonOpRequest()+0x43) [0x6d7663]
 4: (std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >)+0xa6) [0x7f30dc4421d6]
 5: (OpHistory::cleanup(utime_t)+0x31d) [0x7f30dc43c66d]
 6: (OpHistory::_insert_delayed(utime_t const&, boost::intrusive_ptr<TrackedOp>)+0x276) [0x7f30dc43d8e6]
 7: (OpHistoryServiceThread::entry()+0xf9) [0x7f30dc43df49]
 8: (()+0x76ba) [0x7f30db1a36ba]
 9: (clone()+0x6d) [0x7f30da9cc41d]

(gdb) bt
#0  0x00007f30db1ad269 in raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/pt-raise.c:35
#1  0x00000000008b7473 in reraise_fatal (signum=11) at /build/ceph-14.0.1-3738-gca5ec13/src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=11) at /build/ceph-14.0.1-3738-gca5ec13/src/global/signal_handler.cc:298
#3  <signal handler called>
#4  RefCountedObject::put (this=0x428eb40) at /build/ceph-14.0.1-3738-gca5ec13/src/common/RefCountedObj.h:58
#5  0x00000000006d7663 in intrusive_ptr_release (p=<optimized out>) at /build/ceph-14.0.1-3738-gca5ec13/src/common/RefCountedObj.h:174
#6  boost::intrusive_ptr<RefCountedObject>::~intrusive_ptr (this=0x2e88740, __in_chrg=<optimized out>) at /build/ceph-14.0.1-3738-gca5ec13/obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
#7  MonOpRequest::~MonOpRequest (this=0x2e88670, __in_chrg=<optimized out>) at /build/ceph-14.0.1-3738-gca5ec13/src/mon/MonOpRequest.h:127
#8  MonOpRequest::~MonOpRequest (this=0x2e88670, __in_chrg=<optimized out>) at /build/ceph-14.0.1-3738-gca5ec13/src/mon/MonOpRequest.h:129
#9  0x00007f30dc4421d6 in std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >) () from /usr/lib/ceph/libceph-common.so.1
#10 0x00007f30dc43c66d in OpHistory::cleanup(utime_t) () from /usr/lib/ceph/libceph-common.so.1
#11 0x00007f30dc43d8e6 in OpHistory::_insert_delayed(utime_t const&, boost::intrusive_ptr<TrackedOp>) () from /usr/lib/ceph/libceph-common.so.1
#12 0x00007f30dc43df49 in OpHistoryServiceThread::entry() () from /usr/lib/ceph/libceph-common.so.1
#13 0x00007f30db1a36ba in start_thread (arg=0x7f30d3204700) at pthread_create.c:333
#14 0x00007f30da9cc41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

/a/sage-2019-02-15_04:19:44-upgrade:mimic-x-wip-mimic-upgrade-distro-basic-smithi/3592782

Related issues

Related to RADOS - Bug #38306: ceph-mon: "*** Caught signal (Segmentation fault) **" in upgrade:luminous-x-mimic New 02/13/2019
Related to RADOS - Bug #40367: "*** Caught signal (Segmentation fault) **" in upgrade:luminous-x-nautilus New
Copied to RADOS - Backport #43879: nautilus: mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup Resolved
Copied to RADOS - Backport #43880: luminous: mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup In Progress
Copied to RADOS - Backport #43881: mimic: mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup Resolved

History

#1 Updated by Sage Weil over 1 year ago

  • Subject changed from mon: segv in to mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup
  • Priority changed from High to Urgent

/a/sage-2019-02-15_20:30:53-upgrade:mimic-x-wip-mimic-upgrade-distro-basic-smithi/3595562

#2 Updated by Greg Farnum over 1 year ago

  • Related to Bug #38306: ceph-mon: "*** Caught signal (Segmentation fault) **" in upgrade:luminous-x-mimic added

#3 Updated by Joao Eduardo Luis over 1 year ago

  • Category set to Correctness/Safety
  • Assignee set to Joao Eduardo Luis
  • Component(RADOS) Monitor added

#4 Updated by Sage Weil over 1 year ago

Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64 elfutils-libs-0.170-4.el7.x86_64 fuse-libs-2.9.2-10.el7.x86_64 glibc-2.17-222.el7.x86_64 gperftools-libs-2.6.1-1.el7.x86_64 leveldb-1.12.0-11.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libblkid-2.23.2-52.el7.x86_64 libcap-2.22-9.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libibverbs-15-7.el7_5.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-15-7.el7_5.x86_64 libstdc++-4.8.5-36.el7.x86_64 libuuid-2.23.2-52.el7.x86_64 lttng-ust-2.4.1-4.el7.x86_64 lz4-1.7.5-2.el7.x86_64 nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-5.el7_5.x86_64 nss-softokn-3.36.0-5.el7_5.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 nss-util-3.36.0-1.el7_5.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 snappy-1.1.0-3.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 systemd-libs-219-57.el7.x86_64 userspace-rcu-0.7.16-1.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007fa5323ae59b in raise () from /lib64/libpthread.so.0
#1  0x000056198d2e04e5 in reraise_fatal (signum=11) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=11) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/global/signal_handler.cc:298
#3  <signal handler called>
#4  0x00007fa5323a8c80 in pthread_mutex_lock () from /lib64/libpthread.so.0
#5  0x00007fa535ade441 in __gthread_mutex_lock (__mutex=0x10000000100038) at /opt/rh/devtoolset-7/root/usr/include/c++/7/x86_64-redhat-linux/bits/gthr-default.h:748
#6  lock (this=0x10000000100038) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_mutex.h:103
#7  lock (this=0x7fa52b007410) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_mutex.h:267
#8  unique_lock (__m=..., this=0x7fa52b007410) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_mutex.h:197
#9  ceph::logging::Log::submit_entry(ceph::logging::Entry&&) (this=0x10000000100000, e=e@entry=<unknown type in /usr/lib/debug/usr/lib64/ceph/libceph-common.so.0.debug, CU 0x4530b47, DIE 0x459b6ad>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/log/Log.cc:177
#10 0x000056198d105fa2 in RefCountedObject::put (this=0x56198f7b7880) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/RefCountedObj.h:60
#11 0x000056198d107399 in intrusive_ptr_release (p=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/RefCountedObj.h:174
#12 ~intrusive_ptr (this=0x561990dc7ec8, __in_chrg=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
#13 ~MonOpRequest (this=0x561990dc7e10, __in_chrg=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/mon/MonOpRequest.h:127
#14 MonOpRequest::~MonOpRequest (this=0x561990dc7e10, __in_chrg=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/mon/MonOpRequest.h:129
#15 0x00007fa535841ab6 in put (this=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.h:308
#16 intrusive_ptr_release (o=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.h:394
#17 ~intrusive_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
#18 ~pair (this=<optimized out>, __in_chrg=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_pair.h:198
#19 destroy<std::pair<double, boost::intrusive_ptr<TrackedOp> > > (this=<optimized out>, __p=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/ext/new_allocator.h:140
#20 destroy<std::pair<double, boost::intrusive_ptr<TrackedOp> > > (__a=..., __p=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/alloc_traits.h:487
#21 _M_destroy_node (this=0x561990bf3268, __p=0x561990b28ab0) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_tree.h:650
#22 _M_drop_node (this=0x561990bf3268, __p=0x561990b28ab0) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_tree.h:658
#23 std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux (this=this@entry=0x561990bf3268, __position=...) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_tree.h:2477
#24 0x00007fa53583b944 in erase (__position=..., this=0x561990bf3268) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_tree.h:1113
#25 erase (__position=..., this=0x561990bf3268) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_set.h:645
#26 OpHistory::cleanup (this=this@entry=0x561990bf3238, now=...) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.cc:101
#27 0x00007fa53583cd9d in OpHistory::_insert_delayed (this=0x561990bf3238, now=..., op=...) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.cc:83
#28 0x00007fa53583d378 in OpHistoryServiceThread::entry (this=0x561990bf3308) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.cc:54
#29 0x00007fa5323a6e25 in start_thread () from /lib64/libpthread.so.0
#30 0x00007fa53126fbad in clone () from /lib64/libc.so.6

/a/sage-2019-02-03_18:58:17-rados-wip-sage2-testing-2019-02-03-1047-distro-basic-smithi/3545666
core and log in top level dir

#5 Updated by Sage Weil over 1 year ago

(gdb) do
#10 0x000056198d105fa2 in RefCountedObject::put (this=0x56198f7b7880) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/RefCountedObj.h:60
60                                         << dendl;
(gdb) list
55          CephContext *local_cct = cct;
56          int v = --nref;
57          if (local_cct)
58            lsubdout(local_cct, refs, 1) << "RefCountedObject::put " << this << " " 
59                                         << (v + 1) << " -> " << v
60                                         << dendl;
61          if (v == 0) {
62            ANNOTATE_HAPPENS_AFTER(&nref);
63            ANNOTATE_HAPPENS_BEFORE_FORGET_ALL(&nref);
64            delete this;
(gdb) p cct
$16 = (CephContext *) 0x56198fae0000
(gdb) p local_cct
$17 = (CephContext *) 0x5619a1e93540
(gdb) p nref
$18 = {
  <std::__atomic_base<unsigned long>> = {
    static _S_alignment = 8, 
    _M_i = 4
  }, 
  members of std::atomic<unsigned long>: 
  static is_always_lock_free = true
}
(gdb) p cct->_log
$19 = (ceph::logging::Log *) 0x56198f7b8780
(gdb) p local_cct->_log
$20 = (ceph::logging::Log *) 0x10000000100000

something mangled cct before the old object was destroyed?

(this->cct and nref are potentially meaningless since we may have freed the object and had someone else allocate underneath us?)

#6 Updated by Sage Weil over 1 year ago

Forgot mention, the op appears to be an MForward.

#7 Updated by Greg Farnum about 1 year ago

  • Related to Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics added

#8 Updated by Greg Farnum about 1 year ago

We're also seeing Bus Errors instead of segfaults in the OpHistory cleanup at #24664 so these may be related...

#9 Updated by Kefu Chai about 1 year ago

 ceph version 13.2.5-268-g0d5c736 (0d5c736349288b66b3b12c1df49568d1ed29f90d) mimic (stable)
 1: (()+0xf5d0) [0x7f11bd74d5d0]
 2: (intrusive_ptr_release(RefCountedObject const*)+0x19f) [0x7f11be53ec2f]
 3: (MonOpRequest::~MonOpRequest()+0x49) [0x557d810f8ba9]
 4: (OpHistoryServiceThread::entry()+0x227) [0x7f11be41c0d7]
 5: (()+0x7dd5) [0x7f11bd745dd5]
 6: (clone()+0x6d) [0x7f11ba062ead]

/kchai-2019-05-04_11:38:20-rados-wip-ceph-release-distro-basic-smithi/3927368/

#10 Updated by Neha Ojha about 1 year ago

/a/nojha-2019-05-10_00:33:57-upgrade-wip-parial-recovery-2019-05-09-distro-basic-smithi/3943156/

#11 Updated by Greg Farnum 11 months ago

  • Assignee deleted (Joao Eduardo Luis)
  • Priority changed from Urgent to High

#12 Updated by Sage Weil 9 months ago

2019-10-06T07:00:58.981 INFO:tasks.ceph.mon.c.smithi112.stderr:*** Caught signal (Segmentation fault) **
2019-10-06T07:00:58.981 INFO:tasks.ceph.mon.c.smithi112.stderr: in thread 7f5f2325b700 thread_name:OpHistorySvc
2019-10-06T07:00:58.982 INFO:tasks.ceph.mon.c.smithi112.stderr: ceph version 13.2.6-466-g1720691 (17206912981387416f4e0759804d403ab5e4cf52) mimic (stable)
2019-10-06T07:00:58.982 INFO:tasks.ceph.mon.c.smithi112.stderr: 1: (()+0xf630) [0x7f5f2c8fc630]
2019-10-06T07:00:58.983 INFO:tasks.ceph.mon.c.smithi112.stderr: 2: (intrusive_ptr_release(RefCountedObject const*)+0x19f) [0x7f5f2d6f1e6f]
2019-10-06T07:00:58.983 INFO:tasks.ceph.mon.c.smithi112.stderr: 3: (MonOpRequest::~MonOpRequest()+0x49) [0x5644001b93a9]
2019-10-06T07:00:58.983 INFO:tasks.ceph.mon.c.smithi112.stderr: 4: (OpHistoryServiceThread::entry()+0x227) [0x7f5f2d5ce837]
2019-10-06T07:00:58.983 INFO:tasks.ceph.mon.c.smithi112.stderr: 5: (()+0x7ea5) [0x7f5f2c8f4ea5]
2019-10-06T07:00:58.984 INFO:tasks.ceph.mon.c.smithi112.stderr: 6: (clone()+0x6d) [0x7f5f294158cd]
2019-10-06T07:00:58.984 INFO:tasks.ceph.mon.c.smithi112.stderr:2019-10-06 07:00:58.989 7f5f2325b700 -1 *** Caught signal (Segmentation fault) **

/a/sage-2019-10-05_20:13:12-rados:upgrade-wip-sage2-testing-2019-10-04-1006-distro-basic-smithi/4362554
(mimic -> master(octopus))

#13 Updated by Sage Weil 8 months ago

  • Priority changed from High to Urgent
2019-11-08T03:51:49.994 INFO:tasks.ceph.mon.b.smithi151.stderr:*** Caught signal (Segmentation fault) **
2019-11-08T03:51:49.994 INFO:tasks.ceph.mon.b.smithi151.stderr: in thread 7f0e38f9f700 thread_name:OpHistorySvc
2019-11-08T03:51:50.049 INFO:tasks.ceph.mon.b.smithi151.stderr: ceph version 15.0.0-6959-g6efb5ba (6efb5bac3deee53b8b799cbac04ac70be6e20458) octopus (dev)
2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 1: (()+0xf5f0) [0x7f0e40de25f0]
2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 2: (RefCountedObject::put() const+0x3b) [0x7f0e430a389b]
2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 3: (MonOpRequest::~MonOpRequest()+0x49) [0x56319e6c7ac9]
2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 4: (std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >)+0xac) [0x7f0e430b9c0c]
2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 5: (OpHistory::cleanup(utime_t)+0x315) [0x7f0e430b4305]

/a/sage-2019-11-07_22:38:52-rados-wip-sage-testing-2019-11-07-1412-distro-basic-smithi/4480923

#14 Updated by Brad Hubbard 8 months ago

  • Related to deleted (Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics)

#15 Updated by Brad Hubbard 8 months ago

  • Assignee set to Brad Hubbard

#16 Updated by Neha Ojha 8 months ago

  • Related to Bug #40367: "*** Caught signal (Segmentation fault) **" in upgrade:luminous-x-nautilus added

#17 Updated by Sage Weil 7 months ago

  • Assignee deleted (Brad Hubbard)

/a/sage-2019-11-24_06:32:18-rados-wip-sage-testing-2019-11-23-2031-distro-basic-smithi/4538572

2019-11-24T09:03:07.263 INFO:tasks.ceph.mon.a.smithi097.stderr: ceph version 15.0.0-7623-g3f1e2d9 (3f1e2d974adaf8f1345a5ae935a58a97e43e1a04) octopus (dev)
2019-11-24T09:03:07.263 INFO:tasks.ceph.mon.a.smithi097.stderr: 1: (()+0x12890) [0x7f2d7955d890]
2019-11-24T09:03:07.263 INFO:tasks.ceph.mon.a.smithi097.stderr: 2: (RefCountedObject::put() const+0x44) [0x7f2d7a784964]
2019-11-24T09:03:07.264 INFO:tasks.ceph.mon.a.smithi097.stderr: 3: (MonOpRequest::~MonOpRequest()+0x43) [0x55787f4000d3]
2019-11-24T09:03:07.264 INFO:tasks.ceph.mon.a.smithi097.stderr: 4: (std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp
2019-11-24T09:03:07.264 INFO:tasks.ceph.mon.a.smithi097.stderr: 5: (OpHistory::cleanup(utime_t)+0x31e) [0x7f2d7a79719e]
2019-11-24T09:03:07.264 INFO:tasks.ceph.mon.a.smithi097.stderr: 6: (OpHistory::_insert_delayed(utime_t const&, boost::intrusive_ptr<TrackedOp>)+0x24b) [0x7f2d7a79848b]
2019-11-24T09:03:07.265 INFO:tasks.ceph.mon.a.smithi097.stderr: 7: (OpHistoryServiceThread::entry()+0xf9) [0x7f2d7a798b29]
2019-11-24T09:03:07.265 INFO:tasks.ceph.mon.a.smithi097.stderr: 8: (()+0x76db) [0x7f2d795526db]
2019-11-24T09:03:07.265 INFO:tasks.ceph.mon.a.smithi097.stderr: 9: (clone()+0x3f) [0x7f2d7873888f]

i don't really understand why we see this with upgrade tests but not normally...

#18 Updated by Brad Hubbard 7 months ago

  • Assignee set to Brad Hubbard

#19 Updated by Brad Hubbard 7 months ago

(gdb) bt
#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x000055787f618af0 in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:326
#3  <signal handler called>
#4  RefCountedObject::put (this=0x557882f8eb40) at ./src/common/RefCountedObj.cc:20
#5  0x000055787f4000d3 in intrusive_ptr_release (p=<optimized out>) at ./src/common/RefCountedObj.h:193
#6  boost::intrusive_ptr<RefCountedObject>::~intrusive_ptr (this=0x557883205910, __in_chrg=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
#7  MonOpRequest::~MonOpRequest (this=0x557883205840, __in_chrg=<optimized out>) at ./src/mon/MonOpRequest.h:127
#8  MonOpRequest::~MonOpRequest (this=0x557883205840, __in_chrg=<optimized out>) at ./src/mon/MonOpRequest.h:129
#9  0x00007f2d7a79d4f6 in TrackedOp::put (this=<optimized out>) at ./src/common/TrackedOp.h:302
#10 intrusive_ptr_release (o=<optimized out>) at ./src/common/TrackedOp.h:388
#11 boost::intrusive_ptr<TrackedOp>::~intrusive_ptr (this=<optimized out>, __in_chrg=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
#12 std::pair<double, boost::intrusive_ptr<TrackedOp> >::~pair (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_pair.h:208
#13 __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::destroy<std::pair<double, boost::intrusive_ptr<TrackedOp> > > (this=<optimized out>, __p=<optimized out>)
    at /usr/include/c++/7/ext/new_allocator.h:140
#14 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<double, boost::intrusive_ptr<TrackedOp> > > > >::destroy<std::pair<double, boost::intrusive_ptr<TrackedOp> > > (__a=..., __p=<optimized out>)
    at /usr/include/c++/7/bits/alloc_traits.h:487
#15 std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_destroy_node (this=0x557881e62bb8, __p=0x557882ebcc00) at /usr/include/c++/7/bits/stl_tree.h:650
#16 std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_drop_node (this=0x557881e62bb8, __p=0x557882ebcc00) at /usr/include/c++/7/bits/stl_tree.h:658
#17 std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux (this=this@entry=0x557881e62bb8, __position=...) at /usr/include/c++/7/bits/stl_tree.h:2477
#18 0x00007f2d7a79719e in std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::erase[abi:cxx11](std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >) (__position=..., 
    this=0x557881e62bb8) at /usr/include/c++/7/bits/stl_tree.h:1113
#19 std::set<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::erase[abi:cxx11](std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >) (__position=..., this=0x557881e62bb8) at /usr/include/c++/7/bits/stl_set.h:645
#20 OpHistory::cleanup (this=this@entry=0x557881e62b88, now=...) at ./src/common/TrackedOp.cc:101
#21 0x00007f2d7a79848b in OpHistory::_insert_delayed (this=0x557881e62b88, now=..., op=...) at ./src/common/TrackedOp.cc:83
#22 0x00007f2d7a798b29 in OpHistoryServiceThread::entry (this=0x557881e62c60) at ./src/common/TrackedOp.cc:54
#23 0x00007f2d795526db in start_thread (arg=0x7f2d71718700) at pthread_create.c:463
#24 0x00007f2d7873888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) f 4
#4  RefCountedObject::put (this=0x557882f8eb40) at ./src/common/RefCountedObj.cc:20
20          lsubdout(local_cct, refs, 1) << "RefCountedObject::put " << this << " " 
(gdb) l
15
16      void RefCountedObject::put() const {
17        CephContext *local_cct = cct;
18        auto v = --nref;
19        if (local_cct) {
20          lsubdout(local_cct, refs, 1) << "RefCountedObject::put " << this << " " 
21                         << (v + 1) << " -> " << v
22                         << dendl;
23        }
24        if (v == 0) {
(gdb) p cct
$17 = (CephContext *) 0x557881e86000
(gdb) x/i $pc
=> 0x7f2d7a784964 <RefCountedObject::put() const+68>:   cmpb   $0x0,0xcd(%r13)
(gdb) inf reg r13
r13            0x6d766e2d656d766e       7887612938640258670

That ^ doesn't look right.

(gdb) !echo 0x6d766e2d656d766e|xxd -r ; echo 
mvn-emvn
(gdb) p local_cct
$1 = (CephContext *) 0x6d766e2d656d766e
(gdb) info locals
should_gather = <error reading variable should_gather (Cannot access memory at address 0x6d766e2d656d773b)>
local_cct = 0x6d766e2d656d766e
v = <optimized out>
(gdb) !echo 0x6d766e2d656d773b|xxd -r;echo
mvn-emw;

Stack corruption?

#20 Updated by Brad Hubbard 7 months ago

Neha sent me another instance of this issue available at http://pulpito.ceph.com/nojha-2019-11-22_18:41:03-rados:upgrade:nautilus-x-singleton-wip-revert-asio-distro-basic-smithi/4534111/

Unlike the previous instance in this case 'cct' does not look valid so I think this is a race where the 'cct' variable which is copied to 'local_cct' is invalid when the copy is done.

(gdb) p local_cct
$3 = (CephContext *) 0x55ee00000000
(gdb) p cct
$4 = (CephContext *) 0x55ee00000000
(gdb) x/i $pc
(gdb) p *this
$5 = {_vptr.RefCountedObject = 0x18c, nref = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 395}, static is_always_lock_free = true}, cct = 0x55ee00000000}
(gdb) f 6
#6  ~intrusive_ptr (this=0x55ee603a56a8, __in_chrg=<optimized out>) at /usr/src/debug/ceph-15.0.0-7547-g54ddf2c/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98                                                                    
98              if( px != 0 ) intrusive_ptr_release( px );
(gdb) p *this
$29 = {
  px = 0x55ee60c4a540
}
(gdb) p *(RefCountedObject*)0x55ee60c4a540
$30 = {
  _vptr.RefCountedObject = 0x18c,
  nref = {
    <std::__atomic_base<unsigned long>> = {
      static _S_alignment = 8,
      _M_i = 395
    },
    members of std::atomic<unsigned long>:
    static is_always_lock_free = true
  },
  cct = 0x55ee00000000
}
(gdb) up
#7  ~MonOpRequest (this=0x55ee603a55f0, __in_chrg=<optimized out>) at /usr/src/debug/ceph-15.0.0-7547-g54ddf2c/src/mon/MonOpRequest.h:127
127       ~MonOpRequest() override {
(gdb) l
122       void _dump_op_descriptor_unlocked(ostream& stream) const override {
123         get_req()->print(stream);
124       }
125
126     public:
127       ~MonOpRequest() override {
128         request->put();
129       }
130
131       MonSession *get_session() const {
(gdb) whatis request
type = Message *

So I think this now looks like a race around the 'Message' pointer contained in 'MonOpRequest::request'

#21 Updated by Brad Hubbard 7 months ago

In both cases so far the Message type appears to be MSG_MON_PAXOS and priority is CEPH_MSG_PRIO_HIGH.

  header = {
    seq = {
      v = 20348
    }, 
    tid = {
      v = 0
    }, 
    type = {
      v = 66
    }, 
    priority = {
      v = 196
    }, 
    version = {
      v = 4
    },

#22 Updated by Brad Hubbard 7 months ago

I'm interested to see more cores in case one sheds more light. I've started a few runs in the hope they will fail.

#23 Updated by Brad Hubbard 7 months ago

I'm wondering if maybe this happens due to the feature change and the session being removed during the upgrade process.

(gdb) f
#7  ~MonOpRequest (this=0x55ba3d1a1110, __in_chrg=<optimized out>) at /usr/src/debug/ceph-15.0.0-7623-g3f1e2d9/src/mon/MonOpRequest.h:127
127       ~MonOpRequest() override {
(gdb) p this->session
$2 = {
  px = 0x55ba3a4b4000
}

Looking at that session in the log.

  -660> 2019-11-25T12:20:22.829+0000 7fcbd1497700 10 mon.a@2(peon) e1 ms_handle_accept con 0x55ba3a42e000 session 0x55ba3a4b4000 already on list
  -659> 2019-11-25T12:20:22.829+0000 7fcbd549f700 10 _calc_signature seq 1 front_crc_ = 1950002789 middle_crc = 0 data_crc = 0 sig = 17759458473788446172
  -658> 2019-11-25T12:20:22.829+0000 7fcbd1497700  1 -- v1:172.21.15.175:6789/0 <== mon.0 v1:172.21.15.45:6789/0 1 ==== mon_probe(probe 49a67658-959c-418d-9c96-1d33bf679dd2 name c mon_release octopus) v7 ==== 59+0+0 (unknown 1950002789 0 0) 0x55ba3aee5b80 con 0x55ba3a42e000
  -657> 2019-11-25T12:20:22.830+0000 7fcbd1497700 10 mon.a@2(peon) e1 _ms_dispatch feature change for mon.0 v1:172.21.15.45:6789/0 (was 4540138292836630523, now 4540138292836696063)
  -656> 2019-11-25T12:20:22.830+0000 7fcbd1497700 10 mon.a@2(peon) e1 remove_session 0x55ba3a4b4000 mon.0 v1:172.21.15.45:6789/0 features 0x3f01cfb8ffacfffb
  -655> 2019-11-25T12:20:22.830+0000 7fcbd1497700 10 mon.a@2(peon) e1 _ms_dispatch new session 0x55ba3efbce00 MonSession(mon.0 v1:172.21.15.45:6789/0 is open , features 0x3f01cfb8ffadffff (luminous)) features 0x3f01cfb8ffadffff

This seems to be the case in all the cores I've looked at.

#24 Updated by Brad Hubbard 7 months ago

(gdb) p this->session
$7 = {
  px = 0x55be577ea000
}
(gdb) p (*(MonSession*)0x55be577ea000)->con
$6 = {
  px = 0x6e20746573207366
}
(gdb) p (*(MonSession*)0x0)->con
Cannot access memory at address 0x18
(gdb) p/x 0x55be577ea000+0x18
$8 = 0x55be577ea018
(gdb) x/s 0x55be577ea018
0x55be577ea018: "fs set name=fs_name,type=CephString name=var,type=CephChoices,strings=max_mds|max_file_size|allow_new_snaps|inline_data|cluster_down|allow_dirfrags|balancer|standby_count_wanted|session_timeout|sessio"...

So it looks like the 'MonSession' object's memory has been freed and reused.

#25 Updated by Brad Hubbard 7 months ago

Theory:

In Monitor::_ms_dispatch() when we detect a feature change we end up with the following sequence.

Monitor::_ms_dispatch() -> Monitor::remove_session() -> SessionMap::remove_session() -> Session::put()

That could invalidate the cct if gets != puts I think. Looking at debugging ref get/puts.

#26 Updated by Brad Hubbard 7 months ago

  • Regression deleted (No)

I think we can dispense with the session put when we call 'remove_session' since we call it when we replace the session with 'set_session'. Testing that possibility.

#27 Updated by Patrick Donnelly 7 months ago

  • Status changed from 12 to New

#28 Updated by Brad Hubbard 7 months ago

  • Status changed from New to In Progress
  • Backport set to nautilus, mimic, luminous

#29 Updated by Brad Hubbard 6 months ago

  • Pull request ID set to 32365

#30 Updated by Sage Weil 6 months ago

  • Status changed from In Progress to Fix Under Review

#31 Updated by Sage Weil 5 months ago

/a/sage-2020-01-24_13:15:58-rados-wip-sage2-testing-2020-01-23-1953-distro-basic-smithi/4700893

#32 Updated by Sage Weil 5 months ago

  • Target version set to v15.0.0

#33 Updated by Sage Weil 5 months ago

  • Status changed from Fix Under Review to Pending Backport

#34 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #43879: nautilus: mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup added

#35 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #43880: luminous: mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup added

#36 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #43881: mimic: mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup added

Also available in: Atom PDF