Project

General

Profile

Bug #49072

Segmentation fault in thread_name:tp_osd_tp apparently in libpthread

Added by Brad Hubbard about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I suspect there is memory corruption involved and that this is a badly corrupted stack.

0> 2021-01-07T02:05:50.997+0000 7fddbd1c4700 -1 *** Caught signal (Segmentation fault) **
in thread 7fddbd1c4700 thread_name:tp_osd_tp
ceph version 16.0.0-8664-g62bac298 (62bac2989dc869fcd4b06fc286a42a87216fbbb8) pacific (dev)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7fdde591c980]
2: [0x55ac8fadf1c0]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/a/teuthology-2021-01-05_07:01:02-rados-master-distro-basic-smithi/5755585/ amongst others.

NOTE: This tracker was formerly https://tracker.ceph.com/issues/48777 but I accidentally deleted it.

History

#1 Updated by Brad Hubbard about 3 years ago

Looks like this might be it.

Thread 746 "tp_osd_tp" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd11f3700 (LWP 147217)]
0x000055556ff325a0 in ?? ()
(gdb) bt
#0  0x000055556ff325a0 in ?? ()
#1  0x0000555555fc6ad6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x555567ed1140) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#2  0x00005555560a9239 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:684
#3  std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:1123
#4  std::shared_ptr<OSDMap const>::~shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:93
#5  PG::gen_prefix (this=0x55556ff34000, out=...) at ./src/osd/PG.cc:273
#6  0x000055555635a024 in _prefix<PG> (_dout=<optimized out>, t=0x55556ff34000) at ./src/osd/pg_scrubber.cc:34
#7  0x0000555556370a58 in PgScrubber::~PgScrubber (this=0x55556847e800, __in_chrg=<optimized out>) at ./src/osd/pg_scrubber.cc:1795
#8  0x000055555638bfd1 in PrimaryLogScrub::~PrimaryLogScrub (this=0x55556847e800, __in_chrg=<optimized out>) at ./src/osd/PrimaryLogScrub.h:30
#9  PrimaryLogScrub::~PrimaryLogScrub (this=0x55556847e800, __in_chrg=<optimized out>) at ./src/osd/PrimaryLogScrub.h:30
#10 0x00005555561d2409 in PrimaryLogPG::~PrimaryLogPG (this=0x55556ff34000, __in_chrg=<optimized out>) at ./src/osd/PrimaryLogPG.h:1491
#11 0x00005555560ae3e0 in PG::put (this=0x55556ff34000, tag=tag@entry=0x555556f8359a "intptr") at ./src/osd/PG.cc:132
#12 0x00005555560d30a7 in intrusive_ptr_release (pg=<optimized out>) at ./src/osd/PG.h:672
#13 boost::intrusive_ptr<PG>::~intrusive_ptr (this=0x55556a56a5a8, __in_chrg=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
#14 ContainerContext<boost::intrusive_ptr<PG> >::~ContainerContext (this=0x55556a56a5a0, __in_chrg=<optimized out>) at ./src/include/Context.h:129
#15 ContainerContext<boost::intrusive_ptr<PG> >::~ContainerContext (this=0x55556a56a5a0, __in_chrg=<optimized out>) at ./src/include/Context.h:129
#16 0x00005555560291c4 in OSD::ShardedOpWQ::handle_oncommits (this=0x555561084ef8, oncommits=std::__cxx11::list = {...}) at ./src/osd/OSD.h:1680
#17 OSD::ShardedOpWQ::_process (this=0x555561084ef8, thread_index=<optimized out>, hb=<optimized out>) at ./src/osd/OSD.cc:10644
#18 0x000055555669b98c in ShardedThreadPool::shardedthreadpool_worker (this=0x555561084a28, thread_index=2) at ./src/common/WorkQueue.cc:311
#19 0x000055555669ec40 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at ./src/common/WorkQueue.h:637
#20 0x00007ffff61396db in start_thread (arg=0x7fffd11f3700) at pthread_create.c:463
#21 0x00007ffff4ed971f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) f 11
#11 0x00005555560ae3e0 in PG::put (this=0x55556ff34000, tag=tag@entry=0x555556f8359a "intptr") at ./src/osd/PG.cc:132
132         delete this;

So this seems to be happening when the PG is deleting itself. I guess a good start might be to look at recent commits that may have changed this behaviour.

(gdb) f
#5  PG::gen_prefix (this=0x55556ff34000, out=...) at ./src/osd/PG.cc:273
273       OSDMapRef mapref = recovery_state.get_osdmap();
(gdb) p recovery_state.osdmap_ref 
$3 = std::shared_ptr<const OSDMap> (expired, weak count 0) = {get() = 0x555570b4b900}

This may have something to do with the lifespan of recovery_state.osdmap_ref.

#2 Updated by Brad Hubbard about 3 years ago

/a/jafaj-2021-01-05_16:20:30-rados-wip-jan-testing-2021-01-05-1401-distro-basic-smithi/5756811 with logs, coredump is the same.

(gdb) bt
#0  0x00007f015bbc09bf in raise () from /lib64/libpthread.so.0
#1  0x000056154df19363 in reraise_fatal (signum=11) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/global/signal_handler.cc:332
#2  handle_fatal_signal (signum=11) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/global/signal_handler.cc:332
#3  <signal handler called>
#4  0x0000000000000000 in ?? ()
#5  0x000056154d92f147 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x561558ff50b0) at /usr/include/c++/8/bits/shared_ptr_base.h:148
#6  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x561558ff50b0) at /usr/include/c++/8/bits/shared_ptr_base.h:148
#7  0x000056154d97dd49 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
#8  std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
#9  std::shared_ptr<OSDMap const>::~shared_ptr (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103
#10 PG::gen_prefix (this=<optimized out>, out=...) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PG.cc:273
#11 0x000056154dc34104 in _prefix<PG> (_dout=<optimized out>, t=0x5615587d2000) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/pg_scrubber.cc:32
#12 0x000056154dc4b069 in PgScrubber::~PgScrubber (this=0x5615587c2800, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/pg_scrubber.cc:1795
#13 0x000056154dc62d85 in PrimaryLogScrub::~PrimaryLogScrub (this=0x5615587c2800, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PrimaryLogScrub.h:30
#14 PrimaryLogScrub::~PrimaryLogScrub (this=0x5615587c2800, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PrimaryLogScrub.h:30
#15 0x000056154daa52cd in PrimaryLogPG::~PrimaryLogPG (this=0x5615587d2000, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PrimaryLogPG.h:1491
#16 0x000056154d982c29 in PG::put (this=0x5615587d2000, tag=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PG.cc:132
#17 0x000056154d9a7d5b in intrusive_ptr_release (pg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/PG.h:672
#18 boost::intrusive_ptr<PG>::~intrusive_ptr (this=0x561558d98f38, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
#19 ContainerContext<boost::intrusive_ptr<PG> >::~ContainerContext (this=0x561558d98f30, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/include/Context.h:129
#20 ContainerContext<boost::intrusive_ptr<PG> >::~ContainerContext (this=0x561558d98f30, __in_chrg=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/include/Context.h:129
#21 0x000056154d901ac4 in OSD::ShardedOpWQ::handle_oncommits (this=<optimized out>, oncommits=std::__cxx11::list = {...}) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/OSD.h:1680
#22 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=1, hb=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/osd/OSD.cc:10644
#23 0x000056154df62cd4 in ShardedThreadPool::shardedthreadpool_worker (this=0x561557952a28, thread_index=1) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/common/WorkQueue.cc:311
#24 0x000056154df65974 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-16.0.0-8690.gb6596802.el8.x86_64/src/common/WorkQueue.h:637
#25 0x00007f015bbb614a in start_thread () from /lib64/libpthread.so.0
#26 0x00007f015a8edf23 in clone () from /lib64/libc.so.6

#3 Updated by Brad Hubbard about 3 years ago

  • Status changed from New to Resolved
  • Pull request ID set to 38860

/a/kchai-2021-01-11_11:52:22-rados-wip-kefu2-testing-2021-01-10-1949-distro-basic-smithi/5777646

PG::recovery_state is defined after PG::m_scrubber, while PG::gen_prefix() retrieves the osdmap from recovery_state. in PgScrubber::~PgScrubber(), we have

dout(10) << __func__ << dendl;

which in turn uses

#define dout_prefix _prefix(_dout, this->m_pg)

template <class T> static ostream& _prefix(std::ostream* _dout, T* t)
{
  return t->gen_prefix(*_dout) << " scrubber pg(" << t->pg_id << ") ";
}

for printing the prefix in logging messages. so PgScrubber's destructor is referencing "recovery_state" which is already destroyed.

created https://github.com/ceph/ceph/pull/38860 to address the test failures before we have a real fix.

#4 Updated by Brad Hubbard about 3 years ago

Note that Kefu did the heavy lifting in comment 3.

Also available in: Atom PDF