Bug #38345
closedmon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup
0%
Description
2019-02-15 09:48:45.381 7f30cf9fd700 1 -- v1:172.21.15.148:6789/0 <== mon.1 v1:172.21.15.102:6789/0 2 ==== election(8ec471df-d03c-4d8b-9fa6-53fd073b70cc propose 11) v7 ==== 435+0+0 (539273810 0 0) 0x3cd5e40 con 0x437cd80 ... 2019-02-15 09:48:45.389 7f30cf9fd700 20 -- v1:172.21.15.148:6789/0 done calling dispatch on 0x3cd5e40 ... 2019-02-15 09:49:57.145 7f30d3204700 -1 *** Caught signal (Segmentation fault) ** in thread 7f30d3204700 thread_name:OpHistorySvc ceph version 14.0.1-3738-gca5ec13 (ca5ec139e4b98becd397a459018a659a332bc291) nautilus (dev) 1: (()+0x11390) [0x7f30db1ad390] 2: (RefCountedObject::put() const+0x44) [0x6d6014] 3: (MonOpRequest::~MonOpRequest()+0x43) [0x6d7663] 4: (std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >)+0xa6) [0x7f30dc4421d6] 5: (OpHistory::cleanup(utime_t)+0x31d) [0x7f30dc43c66d] 6: (OpHistory::_insert_delayed(utime_t const&, boost::intrusive_ptr<TrackedOp>)+0x276) [0x7f30dc43d8e6] 7: (OpHistoryServiceThread::entry()+0xf9) [0x7f30dc43df49] 8: (()+0x76ba) [0x7f30db1a36ba] 9: (clone()+0x6d) [0x7f30da9cc41d]
(gdb) bt #0 0x00007f30db1ad269 in raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/pt-raise.c:35 #1 0x00000000008b7473 in reraise_fatal (signum=11) at /build/ceph-14.0.1-3738-gca5ec13/src/global/signal_handler.cc:81 #2 handle_fatal_signal (signum=11) at /build/ceph-14.0.1-3738-gca5ec13/src/global/signal_handler.cc:298 #3 <signal handler called> #4 RefCountedObject::put (this=0x428eb40) at /build/ceph-14.0.1-3738-gca5ec13/src/common/RefCountedObj.h:58 #5 0x00000000006d7663 in intrusive_ptr_release (p=<optimized out>) at /build/ceph-14.0.1-3738-gca5ec13/src/common/RefCountedObj.h:174 #6 boost::intrusive_ptr<RefCountedObject>::~intrusive_ptr (this=0x2e88740, __in_chrg=<optimized out>) at /build/ceph-14.0.1-3738-gca5ec13/obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98 #7 MonOpRequest::~MonOpRequest (this=0x2e88670, __in_chrg=<optimized out>) at /build/ceph-14.0.1-3738-gca5ec13/src/mon/MonOpRequest.h:127 #8 MonOpRequest::~MonOpRequest (this=0x2e88670, __in_chrg=<optimized out>) at /build/ceph-14.0.1-3738-gca5ec13/src/mon/MonOpRequest.h:129 #9 0x00007f30dc4421d6 in std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >) () from /usr/lib/ceph/libceph-common.so.1 #10 0x00007f30dc43c66d in OpHistory::cleanup(utime_t) () from /usr/lib/ceph/libceph-common.so.1 #11 0x00007f30dc43d8e6 in OpHistory::_insert_delayed(utime_t const&, boost::intrusive_ptr<TrackedOp>) () from /usr/lib/ceph/libceph-common.so.1 #12 0x00007f30dc43df49 in OpHistoryServiceThread::entry() () from /usr/lib/ceph/libceph-common.so.1 #13 0x00007f30db1a36ba in start_thread (arg=0x7f30d3204700) at pthread_create.c:333 #14 0x00007f30da9cc41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
/a/sage-2019-02-15_04:19:44-upgrade:mimic-x-wip-mimic-upgrade-distro-basic-smithi/3592782
Updated by Sage Weil about 5 years ago
- Subject changed from mon: segv in to mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup
- Priority changed from High to Urgent
/a/sage-2019-02-15_20:30:53-upgrade:mimic-x-wip-mimic-upgrade-distro-basic-smithi/3595562
Updated by Greg Farnum about 5 years ago
- Related to Bug #38306: ceph-mon: "*** Caught signal (Segmentation fault) **" in upgrade:luminous-x-mimic added
Updated by Joao Eduardo Luis about 5 years ago
- Category set to Correctness/Safety
- Assignee set to Joao Eduardo Luis
- Component(RADOS) Monitor added
Updated by Sage Weil about 5 years ago
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64 elfutils-libs-0.170-4.el7.x86_64 fuse-libs-2.9.2-10.el7.x86_64 glibc-2.17-222.el7.x86_64 gperftools-libs-2.6.1-1.el7.x86_64 leveldb-1.12.0-11.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libblkid-2.23.2-52.el7.x86_64 libcap-2.22-9.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libibverbs-15-7.el7_5.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-15-7.el7_5.x86_64 libstdc++-4.8.5-36.el7.x86_64 libuuid-2.23.2-52.el7.x86_64 lttng-ust-2.4.1-4.el7.x86_64 lz4-1.7.5-2.el7.x86_64 nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-5.el7_5.x86_64 nss-softokn-3.36.0-5.el7_5.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 nss-util-3.36.0-1.el7_5.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 snappy-1.1.0-3.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 systemd-libs-219-57.el7.x86_64 userspace-rcu-0.7.16-1.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 0x00007fa5323ae59b in raise () from /lib64/libpthread.so.0 #1 0x000056198d2e04e5 in reraise_fatal (signum=11) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/global/signal_handler.cc:81 #2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/global/signal_handler.cc:298 #3 <signal handler called> #4 0x00007fa5323a8c80 in pthread_mutex_lock () from /lib64/libpthread.so.0 #5 0x00007fa535ade441 in __gthread_mutex_lock (__mutex=0x10000000100038) at /opt/rh/devtoolset-7/root/usr/include/c++/7/x86_64-redhat-linux/bits/gthr-default.h:748 #6 lock (this=0x10000000100038) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_mutex.h:103 #7 lock (this=0x7fa52b007410) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_mutex.h:267 #8 unique_lock (__m=..., this=0x7fa52b007410) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_mutex.h:197 #9 ceph::logging::Log::submit_entry(ceph::logging::Entry&&) (this=0x10000000100000, e=e@entry=<unknown type in /usr/lib/debug/usr/lib64/ceph/libceph-common.so.0.debug, CU 0x4530b47, DIE 0x459b6ad>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/log/Log.cc:177 #10 0x000056198d105fa2 in RefCountedObject::put (this=0x56198f7b7880) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/RefCountedObj.h:60 #11 0x000056198d107399 in intrusive_ptr_release (p=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/RefCountedObj.h:174 #12 ~intrusive_ptr (this=0x561990dc7ec8, __in_chrg=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98 #13 ~MonOpRequest (this=0x561990dc7e10, __in_chrg=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/mon/MonOpRequest.h:127 #14 MonOpRequest::~MonOpRequest (this=0x561990dc7e10, __in_chrg=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/mon/MonOpRequest.h:129 #15 0x00007fa535841ab6 in put (this=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.h:308 #16 intrusive_ptr_release (o=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.h:394 #17 ~intrusive_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/src/debug/ceph-14.1.0-638-ged9e549/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98 #18 ~pair (this=<optimized out>, __in_chrg=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_pair.h:198 #19 destroy<std::pair<double, boost::intrusive_ptr<TrackedOp> > > (this=<optimized out>, __p=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/ext/new_allocator.h:140 #20 destroy<std::pair<double, boost::intrusive_ptr<TrackedOp> > > (__a=..., __p=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/alloc_traits.h:487 #21 _M_destroy_node (this=0x561990bf3268, __p=0x561990b28ab0) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_tree.h:650 #22 _M_drop_node (this=0x561990bf3268, __p=0x561990b28ab0) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_tree.h:658 #23 std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux (this=this@entry=0x561990bf3268, __position=...) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_tree.h:2477 #24 0x00007fa53583b944 in erase (__position=..., this=0x561990bf3268) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_tree.h:1113 #25 erase (__position=..., this=0x561990bf3268) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_set.h:645 #26 OpHistory::cleanup (this=this@entry=0x561990bf3238, now=...) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.cc:101 #27 0x00007fa53583cd9d in OpHistory::_insert_delayed (this=0x561990bf3238, now=..., op=...) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.cc:83 #28 0x00007fa53583d378 in OpHistoryServiceThread::entry (this=0x561990bf3308) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/TrackedOp.cc:54 #29 0x00007fa5323a6e25 in start_thread () from /lib64/libpthread.so.0 #30 0x00007fa53126fbad in clone () from /lib64/libc.so.6
/a/sage-2019-02-03_18:58:17-rados-wip-sage2-testing-2019-02-03-1047-distro-basic-smithi/3545666
core and log in top level dir
Updated by Sage Weil about 5 years ago
(gdb) do #10 0x000056198d105fa2 in RefCountedObject::put (this=0x56198f7b7880) at /usr/src/debug/ceph-14.1.0-638-ged9e549/src/common/RefCountedObj.h:60 60 << dendl; (gdb) list 55 CephContext *local_cct = cct; 56 int v = --nref; 57 if (local_cct) 58 lsubdout(local_cct, refs, 1) << "RefCountedObject::put " << this << " " 59 << (v + 1) << " -> " << v 60 << dendl; 61 if (v == 0) { 62 ANNOTATE_HAPPENS_AFTER(&nref); 63 ANNOTATE_HAPPENS_BEFORE_FORGET_ALL(&nref); 64 delete this; (gdb) p cct $16 = (CephContext *) 0x56198fae0000 (gdb) p local_cct $17 = (CephContext *) 0x5619a1e93540 (gdb) p nref $18 = { <std::__atomic_base<unsigned long>> = { static _S_alignment = 8, _M_i = 4 }, members of std::atomic<unsigned long>: static is_always_lock_free = true } (gdb) p cct->_log $19 = (ceph::logging::Log *) 0x56198f7b8780 (gdb) p local_cct->_log $20 = (ceph::logging::Log *) 0x10000000100000
something mangled cct before the old object was destroyed?
(this->cct and nref are potentially meaningless since we may have freed the object and had someone else allocate underneath us?)
Updated by Sage Weil about 5 years ago
Forgot mention, the op appears to be an MForward.
Updated by Greg Farnum about 5 years ago
- Related to Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics added
Updated by Greg Farnum about 5 years ago
We're also seeing Bus Errors instead of segfaults in the OpHistory cleanup at #24664 so these may be related...
Updated by Kefu Chai almost 5 years ago
ceph version 13.2.5-268-g0d5c736 (0d5c736349288b66b3b12c1df49568d1ed29f90d) mimic (stable) 1: (()+0xf5d0) [0x7f11bd74d5d0] 2: (intrusive_ptr_release(RefCountedObject const*)+0x19f) [0x7f11be53ec2f] 3: (MonOpRequest::~MonOpRequest()+0x49) [0x557d810f8ba9] 4: (OpHistoryServiceThread::entry()+0x227) [0x7f11be41c0d7] 5: (()+0x7dd5) [0x7f11bd745dd5] 6: (clone()+0x6d) [0x7f11ba062ead]
/kchai-2019-05-04_11:38:20-rados-wip-ceph-release-distro-basic-smithi/3927368/
Updated by Neha Ojha almost 5 years ago
/a/nojha-2019-05-10_00:33:57-upgrade-wip-parial-recovery-2019-05-09-distro-basic-smithi/3943156/
Updated by Greg Farnum over 4 years ago
- Assignee deleted (
Joao Eduardo Luis) - Priority changed from Urgent to High
Updated by Sage Weil over 4 years ago
2019-10-06T07:00:58.981 INFO:tasks.ceph.mon.c.smithi112.stderr:*** Caught signal (Segmentation fault) ** 2019-10-06T07:00:58.981 INFO:tasks.ceph.mon.c.smithi112.stderr: in thread 7f5f2325b700 thread_name:OpHistorySvc 2019-10-06T07:00:58.982 INFO:tasks.ceph.mon.c.smithi112.stderr: ceph version 13.2.6-466-g1720691 (17206912981387416f4e0759804d403ab5e4cf52) mimic (stable) 2019-10-06T07:00:58.982 INFO:tasks.ceph.mon.c.smithi112.stderr: 1: (()+0xf630) [0x7f5f2c8fc630] 2019-10-06T07:00:58.983 INFO:tasks.ceph.mon.c.smithi112.stderr: 2: (intrusive_ptr_release(RefCountedObject const*)+0x19f) [0x7f5f2d6f1e6f] 2019-10-06T07:00:58.983 INFO:tasks.ceph.mon.c.smithi112.stderr: 3: (MonOpRequest::~MonOpRequest()+0x49) [0x5644001b93a9] 2019-10-06T07:00:58.983 INFO:tasks.ceph.mon.c.smithi112.stderr: 4: (OpHistoryServiceThread::entry()+0x227) [0x7f5f2d5ce837] 2019-10-06T07:00:58.983 INFO:tasks.ceph.mon.c.smithi112.stderr: 5: (()+0x7ea5) [0x7f5f2c8f4ea5] 2019-10-06T07:00:58.984 INFO:tasks.ceph.mon.c.smithi112.stderr: 6: (clone()+0x6d) [0x7f5f294158cd] 2019-10-06T07:00:58.984 INFO:tasks.ceph.mon.c.smithi112.stderr:2019-10-06 07:00:58.989 7f5f2325b700 -1 *** Caught signal (Segmentation fault) **
/a/sage-2019-10-05_20:13:12-rados:upgrade-wip-sage2-testing-2019-10-04-1006-distro-basic-smithi/4362554
(mimic -> master(octopus))
Updated by Sage Weil over 4 years ago
- Priority changed from High to Urgent
2019-11-08T03:51:49.994 INFO:tasks.ceph.mon.b.smithi151.stderr:*** Caught signal (Segmentation fault) ** 2019-11-08T03:51:49.994 INFO:tasks.ceph.mon.b.smithi151.stderr: in thread 7f0e38f9f700 thread_name:OpHistorySvc 2019-11-08T03:51:50.049 INFO:tasks.ceph.mon.b.smithi151.stderr: ceph version 15.0.0-6959-g6efb5ba (6efb5bac3deee53b8b799cbac04ac70be6e20458) octopus (dev) 2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 1: (()+0xf5f0) [0x7f0e40de25f0] 2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 2: (RefCountedObject::put() const+0x3b) [0x7f0e430a389b] 2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 3: (MonOpRequest::~MonOpRequest()+0x49) [0x56319e6c7ac9] 2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 4: (std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >)+0xac) [0x7f0e430b9c0c] 2019-11-08T03:51:50.050 INFO:tasks.ceph.mon.b.smithi151.stderr: 5: (OpHistory::cleanup(utime_t)+0x315) [0x7f0e430b4305]
/a/sage-2019-11-07_22:38:52-rados-wip-sage-testing-2019-11-07-1412-distro-basic-smithi/4480923
Updated by Brad Hubbard over 4 years ago
- Related to deleted (Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics)
Updated by Neha Ojha over 4 years ago
- Related to Bug #40367: "*** Caught signal (Segmentation fault) **" in upgrade:luminous-x-nautilus added
Updated by Sage Weil over 4 years ago
- Assignee deleted (
Brad Hubbard)
/a/sage-2019-11-24_06:32:18-rados-wip-sage-testing-2019-11-23-2031-distro-basic-smithi/4538572
2019-11-24T09:03:07.263 INFO:tasks.ceph.mon.a.smithi097.stderr: ceph version 15.0.0-7623-g3f1e2d9 (3f1e2d974adaf8f1345a5ae935a58a97e43e1a04) octopus (dev) 2019-11-24T09:03:07.263 INFO:tasks.ceph.mon.a.smithi097.stderr: 1: (()+0x12890) [0x7f2d7955d890] 2019-11-24T09:03:07.263 INFO:tasks.ceph.mon.a.smithi097.stderr: 2: (RefCountedObject::put() const+0x44) [0x7f2d7a784964] 2019-11-24T09:03:07.264 INFO:tasks.ceph.mon.a.smithi097.stderr: 3: (MonOpRequest::~MonOpRequest()+0x43) [0x55787f4000d3] 2019-11-24T09:03:07.264 INFO:tasks.ceph.mon.a.smithi097.stderr: 4: (std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp 2019-11-24T09:03:07.264 INFO:tasks.ceph.mon.a.smithi097.stderr: 5: (OpHistory::cleanup(utime_t)+0x31e) [0x7f2d7a79719e] 2019-11-24T09:03:07.264 INFO:tasks.ceph.mon.a.smithi097.stderr: 6: (OpHistory::_insert_delayed(utime_t const&, boost::intrusive_ptr<TrackedOp>)+0x24b) [0x7f2d7a79848b] 2019-11-24T09:03:07.265 INFO:tasks.ceph.mon.a.smithi097.stderr: 7: (OpHistoryServiceThread::entry()+0xf9) [0x7f2d7a798b29] 2019-11-24T09:03:07.265 INFO:tasks.ceph.mon.a.smithi097.stderr: 8: (()+0x76db) [0x7f2d795526db] 2019-11-24T09:03:07.265 INFO:tasks.ceph.mon.a.smithi097.stderr: 9: (clone()+0x3f) [0x7f2d7873888f]
i don't really understand why we see this with upgrade tests but not normally...
Updated by Brad Hubbard over 4 years ago
(gdb) bt #0 raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x000055787f618af0 in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:81 #2 handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:326 #3 <signal handler called> #4 RefCountedObject::put (this=0x557882f8eb40) at ./src/common/RefCountedObj.cc:20 #5 0x000055787f4000d3 in intrusive_ptr_release (p=<optimized out>) at ./src/common/RefCountedObj.h:193 #6 boost::intrusive_ptr<RefCountedObject>::~intrusive_ptr (this=0x557883205910, __in_chrg=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98 #7 MonOpRequest::~MonOpRequest (this=0x557883205840, __in_chrg=<optimized out>) at ./src/mon/MonOpRequest.h:127 #8 MonOpRequest::~MonOpRequest (this=0x557883205840, __in_chrg=<optimized out>) at ./src/mon/MonOpRequest.h:129 #9 0x00007f2d7a79d4f6 in TrackedOp::put (this=<optimized out>) at ./src/common/TrackedOp.h:302 #10 intrusive_ptr_release (o=<optimized out>) at ./src/common/TrackedOp.h:388 #11 boost::intrusive_ptr<TrackedOp>::~intrusive_ptr (this=<optimized out>, __in_chrg=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98 #12 std::pair<double, boost::intrusive_ptr<TrackedOp> >::~pair (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_pair.h:208 #13 __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::destroy<std::pair<double, boost::intrusive_ptr<TrackedOp> > > (this=<optimized out>, __p=<optimized out>) at /usr/include/c++/7/ext/new_allocator.h:140 #14 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<double, boost::intrusive_ptr<TrackedOp> > > > >::destroy<std::pair<double, boost::intrusive_ptr<TrackedOp> > > (__a=..., __p=<optimized out>) at /usr/include/c++/7/bits/alloc_traits.h:487 #15 std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_destroy_node (this=0x557881e62bb8, __p=0x557882ebcc00) at /usr/include/c++/7/bits/stl_tree.h:650 #16 std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_drop_node (this=0x557881e62bb8, __p=0x557882ebcc00) at /usr/include/c++/7/bits/stl_tree.h:658 #17 std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux (this=this@entry=0x557881e62bb8, __position=...) at /usr/include/c++/7/bits/stl_tree.h:2477 #18 0x00007f2d7a79719e in std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::erase[abi:cxx11](std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >) (__position=..., this=0x557881e62bb8) at /usr/include/c++/7/bits/stl_tree.h:1113 #19 std::set<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::erase[abi:cxx11](std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >) (__position=..., this=0x557881e62bb8) at /usr/include/c++/7/bits/stl_set.h:645 #20 OpHistory::cleanup (this=this@entry=0x557881e62b88, now=...) at ./src/common/TrackedOp.cc:101 #21 0x00007f2d7a79848b in OpHistory::_insert_delayed (this=0x557881e62b88, now=..., op=...) at ./src/common/TrackedOp.cc:83 #22 0x00007f2d7a798b29 in OpHistoryServiceThread::entry (this=0x557881e62c60) at ./src/common/TrackedOp.cc:54 #23 0x00007f2d795526db in start_thread (arg=0x7f2d71718700) at pthread_create.c:463 #24 0x00007f2d7873888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) f 4 #4 RefCountedObject::put (this=0x557882f8eb40) at ./src/common/RefCountedObj.cc:20 20 lsubdout(local_cct, refs, 1) << "RefCountedObject::put " << this << " " (gdb) l 15 16 void RefCountedObject::put() const { 17 CephContext *local_cct = cct; 18 auto v = --nref; 19 if (local_cct) { 20 lsubdout(local_cct, refs, 1) << "RefCountedObject::put " << this << " " 21 << (v + 1) << " -> " << v 22 << dendl; 23 } 24 if (v == 0) { (gdb) p cct $17 = (CephContext *) 0x557881e86000 (gdb) x/i $pc => 0x7f2d7a784964 <RefCountedObject::put() const+68>: cmpb $0x0,0xcd(%r13) (gdb) inf reg r13 r13 0x6d766e2d656d766e 7887612938640258670
That ^ doesn't look right.
(gdb) !echo 0x6d766e2d656d766e|xxd -r ; echo mvn-emvn (gdb) p local_cct $1 = (CephContext *) 0x6d766e2d656d766e (gdb) info locals should_gather = <error reading variable should_gather (Cannot access memory at address 0x6d766e2d656d773b)> local_cct = 0x6d766e2d656d766e v = <optimized out> (gdb) !echo 0x6d766e2d656d773b|xxd -r;echo mvn-emw;
Stack corruption?
Updated by Brad Hubbard over 4 years ago
Neha sent me another instance of this issue available at http://pulpito.ceph.com/nojha-2019-11-22_18:41:03-rados:upgrade:nautilus-x-singleton-wip-revert-asio-distro-basic-smithi/4534111/
Unlike the previous instance in this case 'cct' does not look valid so I think this is a race where the 'cct' variable which is copied to 'local_cct' is invalid when the copy is done.
(gdb) p local_cct $3 = (CephContext *) 0x55ee00000000 (gdb) p cct $4 = (CephContext *) 0x55ee00000000 (gdb) x/i $pc (gdb) p *this $5 = {_vptr.RefCountedObject = 0x18c, nref = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 395}, static is_always_lock_free = true}, cct = 0x55ee00000000} (gdb) f 6 #6 ~intrusive_ptr (this=0x55ee603a56a8, __in_chrg=<optimized out>) at /usr/src/debug/ceph-15.0.0-7547-g54ddf2c/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98 98 if( px != 0 ) intrusive_ptr_release( px ); (gdb) p *this $29 = { px = 0x55ee60c4a540 } (gdb) p *(RefCountedObject*)0x55ee60c4a540 $30 = { _vptr.RefCountedObject = 0x18c, nref = { <std::__atomic_base<unsigned long>> = { static _S_alignment = 8, _M_i = 395 }, members of std::atomic<unsigned long>: static is_always_lock_free = true }, cct = 0x55ee00000000 } (gdb) up #7 ~MonOpRequest (this=0x55ee603a55f0, __in_chrg=<optimized out>) at /usr/src/debug/ceph-15.0.0-7547-g54ddf2c/src/mon/MonOpRequest.h:127 127 ~MonOpRequest() override { (gdb) l 122 void _dump_op_descriptor_unlocked(ostream& stream) const override { 123 get_req()->print(stream); 124 } 125 126 public: 127 ~MonOpRequest() override { 128 request->put(); 129 } 130 131 MonSession *get_session() const { (gdb) whatis request type = Message *
So I think this now looks like a race around the 'Message' pointer contained in 'MonOpRequest::request'
Updated by Brad Hubbard over 4 years ago
In both cases so far the Message type appears to be MSG_MON_PAXOS and priority is CEPH_MSG_PRIO_HIGH.
header = { seq = { v = 20348 }, tid = { v = 0 }, type = { v = 66 }, priority = { v = 196 }, version = { v = 4 },
Updated by Brad Hubbard over 4 years ago
I'm interested to see more cores in case one sheds more light. I've started a few runs in the hope they will fail.
Updated by Brad Hubbard over 4 years ago
I'm wondering if maybe this happens due to the feature change and the session being removed during the upgrade process.
(gdb) f #7 ~MonOpRequest (this=0x55ba3d1a1110, __in_chrg=<optimized out>) at /usr/src/debug/ceph-15.0.0-7623-g3f1e2d9/src/mon/MonOpRequest.h:127 127 ~MonOpRequest() override { (gdb) p this->session $2 = { px = 0x55ba3a4b4000 }
Looking at that session in the log.
-660> 2019-11-25T12:20:22.829+0000 7fcbd1497700 10 mon.a@2(peon) e1 ms_handle_accept con 0x55ba3a42e000 session 0x55ba3a4b4000 already on list -659> 2019-11-25T12:20:22.829+0000 7fcbd549f700 10 _calc_signature seq 1 front_crc_ = 1950002789 middle_crc = 0 data_crc = 0 sig = 17759458473788446172 -658> 2019-11-25T12:20:22.829+0000 7fcbd1497700 1 -- v1:172.21.15.175:6789/0 <== mon.0 v1:172.21.15.45:6789/0 1 ==== mon_probe(probe 49a67658-959c-418d-9c96-1d33bf679dd2 name c mon_release octopus) v7 ==== 59+0+0 (unknown 1950002789 0 0) 0x55ba3aee5b80 con 0x55ba3a42e000 -657> 2019-11-25T12:20:22.830+0000 7fcbd1497700 10 mon.a@2(peon) e1 _ms_dispatch feature change for mon.0 v1:172.21.15.45:6789/0 (was 4540138292836630523, now 4540138292836696063) -656> 2019-11-25T12:20:22.830+0000 7fcbd1497700 10 mon.a@2(peon) e1 remove_session 0x55ba3a4b4000 mon.0 v1:172.21.15.45:6789/0 features 0x3f01cfb8ffacfffb -655> 2019-11-25T12:20:22.830+0000 7fcbd1497700 10 mon.a@2(peon) e1 _ms_dispatch new session 0x55ba3efbce00 MonSession(mon.0 v1:172.21.15.45:6789/0 is open , features 0x3f01cfb8ffadffff (luminous)) features 0x3f01cfb8ffadffff
This seems to be the case in all the cores I've looked at.
Updated by Brad Hubbard over 4 years ago
(gdb) p this->session $7 = { px = 0x55be577ea000 } (gdb) p (*(MonSession*)0x55be577ea000)->con $6 = { px = 0x6e20746573207366 } (gdb) p (*(MonSession*)0x0)->con Cannot access memory at address 0x18 (gdb) p/x 0x55be577ea000+0x18 $8 = 0x55be577ea018 (gdb) x/s 0x55be577ea018 0x55be577ea018: "fs set name=fs_name,type=CephString name=var,type=CephChoices,strings=max_mds|max_file_size|allow_new_snaps|inline_data|cluster_down|allow_dirfrags|balancer|standby_count_wanted|session_timeout|sessio"...
So it looks like the 'MonSession' object's memory has been freed and reused.
Updated by Brad Hubbard over 4 years ago
Theory:
In Monitor::_ms_dispatch() when we detect a feature change we end up with the following sequence.
Monitor::_ms_dispatch() -> Monitor::remove_session() -> SessionMap::remove_session() -> Session::put()
That could invalidate the cct if gets != puts I think. Looking at debugging ref get/puts.
Updated by Brad Hubbard over 4 years ago
- Regression deleted (
No)
I think we can dispense with the session put when we call 'remove_session' since we call it when we replace the session with 'set_session'. Testing that possibility.
Updated by Brad Hubbard over 4 years ago
- Status changed from New to In Progress
- Backport set to nautilus, mimic, luminous
Updated by Sage Weil over 4 years ago
- Status changed from In Progress to Fix Under Review
Updated by Sage Weil over 4 years ago
/a/sage-2020-01-24_13:15:58-rados-wip-sage2-testing-2020-01-23-1953-distro-basic-smithi/4700893
Updated by Sage Weil over 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #43879: nautilus: mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup added
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #43880: luminous: mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup added
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #43881: mimic: mon: segv in MonOpRequest::~MonOpRequest OpHistory::cleanup added
Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".