Bug #49072
Segmentation fault in thread_name:tp_osd_tp apparently in libpthread
0%
Description
I suspect there is memory corruption involved and that this is a badly corrupted stack.
0> 2021-01-07T02:05:50.997+0000 7fddbd1c4700 -1 *** Caught signal (Segmentation fault) ** in thread 7fddbd1c4700 thread_name:tp_osd_tp ceph version 16.0.0-8664-g62bac298 (62bac2989dc869fcd4b06fc286a42a87216fbbb8) pacific (dev) 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7fdde591c980] 2: [0x55ac8fadf1c0] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
/a/teuthology-2021-01-05_07:01:02-rados-master-distro-basic-smithi/5755585/ amongst others.
NOTE: This tracker was formerly https://tracker.ceph.com/issues/48777 but I accidentally deleted it.
History
#1 Updated by Brad Hubbard about 3 years ago
Looks like this might be it.
Thread 746 "tp_osd_tp" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffd11f3700 (LWP 147217)] 0x000055556ff325a0 in ?? () (gdb) bt #0 0x000055556ff325a0 in ?? () #1 0x0000555555fc6ad6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x555567ed1140) at /usr/include/c++/7/bits/shared_ptr_base.h:154 #2 0x00005555560a9239 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:684 #3 std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:1123 #4 std::shared_ptr<OSDMap const>::~shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:93 #5 PG::gen_prefix (this=0x55556ff34000, out=...) at ./src/osd/PG.cc:273 #6 0x000055555635a024 in _prefix<PG> (_dout=<optimized out>, t=0x55556ff34000) at ./src/osd/pg_scrubber.cc:34 #7 0x0000555556370a58 in PgScrubber::~PgScrubber (this=0x55556847e800, __in_chrg=<optimized out>) at ./src/osd/pg_scrubber.cc:1795 #8 0x000055555638bfd1 in PrimaryLogScrub::~PrimaryLogScrub (this=0x55556847e800, __in_chrg=<optimized out>) at ./src/osd/PrimaryLogScrub.h:30 #9 PrimaryLogScrub::~PrimaryLogScrub (this=0x55556847e800, __in_chrg=<optimized out>) at ./src/osd/PrimaryLogScrub.h:30 #10 0x00005555561d2409 in PrimaryLogPG::~PrimaryLogPG (this=0x55556ff34000, __in_chrg=<optimized out>) at ./src/osd/PrimaryLogPG.h:1491 #11 0x00005555560ae3e0 in PG::put (this=0x55556ff34000, tag=tag@entry=0x555556f8359a "intptr") at ./src/osd/PG.cc:132 #12 0x00005555560d30a7 in intrusive_ptr_release (pg=<optimized out>) at ./src/osd/PG.h:672 #13 boost::intrusive_ptr<PG>::~intrusive_ptr (this=0x55556a56a5a8, __in_chrg=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98 #14 ContainerContext<boost::intrusive_ptr<PG> >::~ContainerContext (this=0x55556a56a5a0, __in_chrg=<optimized out>) at ./src/include/Context.h:129 #15 ContainerContext<boost::intrusive_ptr<PG> >::~ContainerContext (this=0x55556a56a5a0, __in_chrg=<optimized out>) at ./src/include/Context.h:129 #16 0x00005555560291c4 in OSD::ShardedOpWQ::handle_oncommits (this=0x555561084ef8, oncommits=std::__cxx11::list = {...}) at ./src/osd/OSD.h:1680 #17 OSD::ShardedOpWQ::_process (this=0x555561084ef8, thread_index=<optimized out>, hb=<optimized out>) at ./src/osd/OSD.cc:10644 #18 0x000055555669b98c in ShardedThreadPool::shardedthreadpool_worker (this=0x555561084a28, thread_index=2) at ./src/common/WorkQueue.cc:311 #19 0x000055555669ec40 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at ./src/common/WorkQueue.h:637 #20 0x00007ffff61396db in start_thread (arg=0x7fffd11f3700) at pthread_create.c:463 #21 0x00007ffff4ed971f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) f 11 #11 0x00005555560ae3e0 in PG::put (this=0x55556ff34000, tag=tag@entry=0x555556f8359a "intptr") at ./src/osd/PG.cc:132 132 delete this;
So this seems to be happening when the PG is deleting itself. I guess a good start might be to look at recent commits that may have changed this behaviour.
(gdb) f #5 PG::gen_prefix (this=0x55556ff34000, out=...) at ./src/osd/PG.cc:273 273 OSDMapRef mapref = recovery_state.get_osdmap(); (gdb) p recovery_state.osdmap_ref $3 = std::shared_ptr<const OSDMap> (expired, weak count 0) = {get() = 0x555570b4b900}
This may have something to do with the lifespan of recovery_state.osdmap_ref.
#2 Updated by Brad Hubbard about 3 years ago
/a/jafaj-2021-01-05_16:20:30-rados-wip-jan-testing-2021-01-05-1401-distro-basic-smithi/5756811 with logs, coredump is the same.
(gdb) bt #0 0x00007f015bbc09bf in raise () from /lib64/libpthread.so.0 #1 0x000056154df19363 in reraise_fatal (signum=11) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/global/signal_handler.cc:332 #2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/global/signal_handler.cc:332 #3 <signal handler called> #4 0x0000000000000000 in ?? () #5 0x000056154d92f147 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x561558ff50b0) at /usr/include/c++/8/bits/shared_ptr_base.h:148 #6 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x561558ff50b0) at /usr/include/c++/8/bits/shared_ptr_base.h:148 #7 0x000056154d97dd49 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167 #8 std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167 #9 std::shared_ptr<OSDMap const>::~shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103 #10 PG::gen_prefix (this=<optimized out>, out=...) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PG.cc:273 #11 0x000056154dc34104 in _prefix<PG> (_dout=<optimized out>, t=0x5615587d2000) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/pg_scrubber.cc:32 #12 0x000056154dc4b069 in PgScrubber::~PgScrubber (this=0x5615587c2800, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/pg_scrubber.cc:1795 #13 0x000056154dc62d85 in PrimaryLogScrub::~PrimaryLogScrub (this=0x5615587c2800, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PrimaryLogScrub.h:30 #14 PrimaryLogScrub::~PrimaryLogScrub (this=0x5615587c2800, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PrimaryLogScrub.h:30 #15 0x000056154daa52cd in PrimaryLogPG::~PrimaryLogPG (this=0x5615587d2000, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PrimaryLogPG.h:1491 #16 0x000056154d982c29 in PG::put (this=0x5615587d2000, tag=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PG.cc:132 #17 0x000056154d9a7d5b in intrusive_ptr_release (pg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PG.h:672 #18 boost::intrusive_ptr<PG>::~intrusive_ptr (this=0x561558d98f38, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98 #19 ContainerContext<boost::intrusive_ptr<PG> >::~ContainerContext (this=0x561558d98f30, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/include/Context.h:129 #20 ContainerContext<boost::intrusive_ptr<PG> >::~ContainerContext (this=0x561558d98f30, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/include/Context.h:129 #21 0x000056154d901ac4 in OSD::ShardedOpWQ::handle_oncommits (this=<optimized out>, oncommits=std::__cxx11::list = {...}) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/OSD.h:1680 #22 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=1, hb=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/OSD.cc:10644 #23 0x000056154df62cd4 in ShardedThreadPool::shardedthreadpool_worker (this=0x561557952a28, thread_index=1) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/common/WorkQueue.cc:311 #24 0x000056154df65974 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/common/WorkQueue.h:637 #25 0x00007f015bbb614a in start_thread () from /lib64/libpthread.so.0 #26 0x00007f015a8edf23 in clone () from /lib64/libc.so.6
#3 Updated by Brad Hubbard about 3 years ago
- Status changed from New to Resolved
- Pull request ID set to 38860
/a/kchai-2021-01-11_11:52:22-rados-wip-kefu2-testing-2021-01-10-1949-distro-basic-smithi/5777646
PG::recovery_state is defined after PG::m_scrubber, while PG::gen_prefix() retrieves the osdmap from recovery_state. in PgScrubber::~PgScrubber(), we have
dout(10) << __func__ << dendl;
which in turn uses
#define dout_prefix _prefix(_dout, this->m_pg) template <class T> static ostream& _prefix(std::ostream* _dout, T* t) { return t->gen_prefix(*_dout) << " scrubber pg(" << t->pg_id << ") "; } for printing the prefix in logging messages. so PgScrubber's destructor is referencing "recovery_state" which is already destroyed. created https://github.com/ceph/ceph/pull/38860 to address the test failures before we have a real fix.
#4 Updated by Brad Hubbard about 3 years ago
Note that Kefu did the heavy lifting in comment 3.