Project

General

Profile

Bug #43903

osd segv in ceph::buffer::v14_2_0::ptr::release (PGTempMap::decode)

Added by Sage Weil over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

#14 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc:167
#15 <signal handler called>
#16 0x0000561ccd8acc03 in ceph::buffer::v14_2_0::ptr::release (this=this@entry=0x561ce07b4008) at /usr/include/c++/8/bits/atomic_base.h:303
#17 0x0000561ccd999c62 in ceph::buffer::v14_2_0::ptr::~ptr (this=0x561ce07b4008, __in_chrg=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/include/buffer.h:398
#18 ceph::buffer::v14_2_0::ptr_node::~ptr_node (this=0x561ce07b4000, __in_chrg=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/include/buffer.h:398
#19 ceph::buffer::v14_2_0::ptr_node::disposer::operator() (this=<optimized out>, delete_this=0x561ce07b4000) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/include/buffer.h:393
#20 ceph::buffer::v14_2_0::list::buffers_t::clear_and_dispose (this=0x561cdc15cc50) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/include/buffer.h:638
#21 ceph::buffer::v14_2_0::list::clear (this=0x561cdc15cc50) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/include/buffer.h:1057
#22 PGTempMap::decode (this=0x561cdc15cc50, p=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSDMap.h:128
#23 0x0000561ccd971f7f in decode (p=..., c=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSDMap.h:346
#24 OSDMap::decode (this=0x561cd9b49400, bl=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSDMap.cc:3219
#25 0x0000561ccd974e65 in OSDMap::decode (this=this@entry=0x561cd9b49400, bl=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSDMap.cc:3044
#26 0x0000561ccd0a9013 in OSDService::try_get_map (this=0x561cd79b5350, epoch=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:1615
#27 0x0000561ccd0fa310 in OSD::advance_pg (this=0x561cd79b4000, osd_epoch=<optimized out>, pg=0x561ce65c6000, handle=..., rctx=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:8445
#28 0x0000561ccd0fc5c4 in OSD::dequeue_peering_evt (this=0x561cd79b4000, sdata=0x561cd7803d40, pg=0x561ce65c6000, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSDMap.h:670
#29 0x0000561ccd32ddb6 in ceph::osd::scheduler::PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...) at /usr/include/c++/8/ext/atomicity.h:96
#30 0x0000561ccd0ef62f in ceph::osd::scheduler::OpSchedulerItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f2e3b31f3f0) at /usr/include/c++/8/bits/unique_ptr.h:342
#31 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=<optimized out>, hb=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:10677
#32 0x0000561ccd71d094 in ShardedThreadPool::shardedthreadpool_worker (this=0x561cd79b4a28, thread_index=2) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.cc:311
#33 0x0000561ccd71fcf4 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.h:706
#34 0x00007f2e5bb282de in start_thread () from /lib64/libpthread.so.0

several other threads are in interesting places:

Thread 33 (Thread 0x7f2e3bb23700 (LWP 55977)):
#0  0x00007f2e5a9308f7 in __memcmp_avx2_movbe () from /lib64/libc.so.6
#1  0x0000561ccd77a588 in std::char_traits<char>::compare (__n=<optimized out>, __s2=<optimized out>, __s1=<optimized out>) at /usr/include/c++/8/bits/char_traits.h:312
#2  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::compare (__str="benchmark_data_smithi200_80267_object41981", this=0x561ceb56ec50) at /usr/include/c++/8/bits/basic_string.h:2849
#3  std::operator< <char, std::char_traits<char>, std::allocator<char> > (__rhs="benchmark_data_smithi200_80267_object41981", __lhs="benchmark_data_smithi200_80267_object41981") at /usr/include/c++/8/bits/basic_string.h:6136
#4  operator< (r=..., l=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/include/object.h:72
#5  cmp (r=..., l=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/hobject.cc:347
#6  cmp (l=..., r=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/hobject.cc:321
#7  0x0000561ccd191790 in operator< (r=..., l=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/hobject.h:308
#8  std::less<hobject_t>::operator() (this=<optimized out>, __y=..., __x=...) at /usr/include/c++/8/bits/stl_function.h:386
#9  std::_Rb_tree<hobject_t, std::pair<hobject_t const, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> > >, std::_Select1st<std::pair<hobject_t const, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> > > >, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> > > > >::_M_lower_bound (this=<optimized out>, __k=..., __y=0x561ceb56ef70, __x=0x561ceb56ec30) at /usr/include/c++/8/bits/stl_tree.h:1907
#10 std::_Rb_tree<hobject_t, std::pair<hobject_t const, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> > >, std::_Select1st<std::pair<hobject_t const, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> > > >, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> > > > >::find (this=this@entry=0x561cd9b929a8, __k=...) at /usr/include/c++/8/bits/stl_tree.h:2555
#11 0x0000561ccd166030 in std::map<hobject_t, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> >, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> > > > >::find (__x=..., this=0x561cd9b929a8)
    at /usr/include/c++/8/bits/stl_map.h:1193
#12 MissingLoc::num_unfound (this=0x561cd9b92978) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/MissingLoc.h:169
#13 PeeringState::get_num_unfound (this=0x561cd9b91440) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/PeeringState.h:2238
#14 operator<< (out=..., pg=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/PG.cc:3411
#15 0x0000561ccd166590 in PG::gen_prefix (this=0x561cd9b90000, out=...) at /usr/include/c++/8/ostream:556
#16 0x0000561ccd362884 in PeeringState::Active::react (this=0x561cead80000, advmap=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/PeeringState.cc:5553
#17 0x0000561ccd395b65 in boost::statechart::custom_reaction<PeeringState::AdvMap>::react<PeeringState::Active, boost::statechart::event_base, void const*> (eventType=<synthetic pointer>: <optimized out>, evt=..., stt=...)
    at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/build/boost/include/boost/statechart/result.hpp:110
#18 boost::statechart::simple_state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list18<boost::statechart::custom_reaction<PeeringState::AdvMap>, boost::statechart::custom_reaction<MInfoRec>, boost::statechart::custom_reaction<MNotifyRec>, boost::statechart::custom_reaction<MLogRec>, boost::statechart::custom_reaction<MTrim>, boost::statechart::custom_reaction<PeeringState::Backfilled>, boost::statechart::custom_reaction<PeeringState::ActivateCommitted>, boost::statechart::custom_reaction<PeeringState::AllReplicasActivated>, boost::statechart::custom_reaction<DeferRecovery>, boost::statechart::custom_reaction<DeferBackfill>, boost::statechart::custom_reaction<PeeringState::UnfoundRecovery>, boost::statechart::custom_reaction<PeeringState::UnfoundBackfill>, boost::statechart::custom_reaction<RemoteReservationRevokedTooFull>, boost::statechart::custom_reaction<RemoteReservationRevoked>, boost::statechart::custom_reaction<PeeringState::DoRecovery>, boost::statechart::custom_reaction<RenewLease>, boost::statechart::custom_reaction<MLeaseAck>, boost::statechart::custom_reaction<PeeringState::CheckReadable> >, boost::statechart::simple_state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0> > (eventType=0x561cce363e18 <boost::statechart::detail::id_holder<PeeringState::AdvMap>::idProvider_>, evt=..., stt=...)
    at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/build/boost/include/boost/statechart/simple_state.hpp:814
...

and
Thread 7 (Thread 0x7f2e36b19700 (LWP 55987)):
#0  0x00007f2e5a930a08 in __memcmp_avx2_movbe () from /lib64/libc.so.6
#1  0x0000561ccd121f1b in std::char_traits<char>::compare (__n=<optimized out>, __s2=<optimized out>, __s1=<optimized out>) at /usr/include/c++/8/bits/char_traits.h:312
#2  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::compare (__str="0000000368.", '0' <repeats 16 times>, "3073", this=0x561cdcae2660) at /usr/include/c++/8/bits/basic_string.h:2849
#3  std::operator< <char, std::char_traits<char>, std::allocator<char> > (__rhs="0000000368.", '0' <repeats 16 times>, "3073", __lhs="0000000369.", '0' <repeats 16 times>, "3200") at /usr/include/c++/8/bits/basic_string.h:6136
#4  std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::operator() (this=<optimized out>, __y="0000000368.", '0' <repeats 16 times>, "3073", __x="0000000369.", '0' <repeats 16 times>, "3200") at /usr/include/c++/8/bits/stl_function.h:386
#5  std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_lower_bound (this=<optimized out>, __k="0000000368.", '0' <repeats 16 times>, "3073", __y=0x561cdcb12640, __x=0x561cdcae2640)
    at /usr/include/c++/8/bits/stl_tree.h:1907
#6  std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::find (this=this@entry=0x561ce07b9a20, __k="0000000368.", '0' <repeats 16 times>, "3073") at /usr/include/c++/8/bits/stl_tree.h:2555
#7  0x0000561ccd1b33ad in std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::count
    (__x="0000000368.", '0' <repeats 16 times>, "3073", this=0x561ce07b9a20) at /usr/include/c++/8/bits/stl_tree.h:991
#8  PGLog::check (this=0x561ce07b9718) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/PGLog.cc:612
#9  0x0000561ccd1b4019 in PGLog::undirty (this=0x561ce07b9718) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/PGLog.h:676
#10 PGLog::write_log_and_missing (this=this@entry=0x561ce07b9718, t=..., km=km@entry=0x7f2e36b15310, coll=..., log_oid=..., require_rollback=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/PGLog.cc:649
#11 0x0000561ccd1681f6 in PG::prepare_write (this=0x561ce07b7400, info=..., last_written_info=..., past_intervals=..., pglog=..., dirty_info=<optimized out>, dirty_big_info=<optimized out>, need_write_epoch=<optimized out>, t=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/osd_types.h:1589
#12 0x0000561ccd33c690 in PeeringState::write_if_dirty (this=this@entry=0x561ce07b8840, t=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSDMap.h:670
#13 0x0000561ccd35eb17 in PeeringState::recover_got (this=this@entry=0x561ce07b8840, oid=..., v=..., is_delete=is_delete@entry=false, t=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/PeeringState.cc:3896
#14 0x0000561ccd1c8902 in PrimaryLogPG::on_local_recover (this=<optimized out>, hoid=..., _recovery_info=..., obc=std::shared_ptr<ObjectContext> (empty) = {...}, is_delete=<optimized out>, t=0x7f2e36b15d10) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/PrimaryLogPG.cc:405
...

and maybe
Thread 1 (Thread 0x7f2e47b3b700 (LWP 55940)):
#0  boost::intrusive::bstree_algorithms<boost::intrusive::rbtree_node_traits<void*, false> >::insert_unique_check<unsigned int, boost::intrusive::detail::key_nodeptr_comp<MapKey<WeightedPriorityQueue<ceph::osd::scheduler::OpSchedulerItem, unsigned long>::SubQueue, unsigned int>, boost::intrusive::bhtraits<WeightedPriorityQueue<ceph::osd::scheduler::OpSchedulerItem, unsigned long>::SubQueue, boost::intrusive::rbtree_node_traits<void*, false>, (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 3u>, boost::move_detail::identity<WeightedPriorityQueue<ceph::osd::scheduler::OpSchedulerItem, unsigned long>::SubQueue> > > (pdepth=0x0, commit_data=<synthetic pointer>..., comp=..., key=<synthetic pointer>: 255, header=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/build/boost/include/boost/intrusive/detail/tree_value_compare.hpp:178
#1  boost::intrusive::bstbase2<boost::intrusive::bhtraits<WeightedPriorityQueue<ceph::osd::scheduler::OpSchedulerItem, unsigned long>::SubQueue, boost::intrusive::rbtree_node_traits<void*, false>, (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 3u>, void, void, (boost::intrusive::algo_types)5, void>::insert_unique_check<unsigned int, MapKey<WeightedPriorityQueue<ceph::osd::scheduler::OpSchedulerItem, unsigned long>::SubQueue, unsigned int> > (commit_data=<synthetic pointer>..., commit_data=<synthetic pointer>..., key=<synthetic pointer>: 255, this=0x561cd78d3520, comp=...)
    at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/build/boost/include/boost/intrusive/bstree.hpp:500
#2  WeightedPriorityQueue<ceph::osd::scheduler::OpSchedulerItem, unsigned long>::Queue::insert (front=false, item=..., cost=0, cl=0, p=255, this=0x561cd78d3518) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WeightedPriorityQueue.h:217
#3  WeightedPriorityQueue<ceph::osd::scheduler::OpSchedulerItem, unsigned long>::enqueue_strict (item=..., p=255, cl=0, this=0x561cd78d3510) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WeightedPriorityQueue.h:318
#4  ceph::osd::scheduler::ClassedOpQueueScheduler<WeightedPriorityQueue<ceph::osd::scheduler::OpSchedulerItem, unsigned long> >::enqueue (this=0x561cd78d3500, item=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/scheduler/OpScheduler.h:96
#5  0x0000561ccd0f2264 in OSD::ShardedOpWQ::_enqueue (this=0x561cd79b4ec8, item=...) at /usr/include/c++/8/bits/unique_ptr.h:342
#6  0x0000561ccd0f2c68 in ShardedThreadPool::ShardedWQ<ceph::osd::scheduler::OpSchedulerItem>::queue (item=..., this=0x561cd79b4ec8) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.h:684
#7  OSD::enqueue_peering_evt (this=0x561cd79b4000, pgid=..., evt=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9607
#8  0x0000561ccd0fd553 in OSD::consume_map (this=0x561cd79b4000) at /usr/include/c++/8/ext/new_allocator.h:86
#9  0x0000561ccd102a3c in OSD::_committed_osd_maps (this=0x561cd79b4000, first=<optimized out>, last=<optimized out>, m=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:8273
#10 0x0000561ccd1562cb in C_OnMapCommit::finish (this=0x561cdb289e60, r=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:7678
#11 0x0000561ccd10b06d in Context::complete (this=0x561cdb289e60, r=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/include/Context.h:77
#12 0x0000561ccd6e8f15 in Finisher::finisher_thread_entry (this=0x561cd84f0448) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/Finisher.cc:66
#13 0x00007f2e5bb282de in start_thread () from /lib64/libpthread.so.0

/a/sage-2020-01-29_20:14:58-rados-wip-sage-testing-2020-01-29-1034-distro-basic-smithi/4718264


Related issues

Related to RADOS - Bug #46443: ceph_osd crash in _committed_osd_maps when failed to encode first inc map Resolved
Copied to RADOS - Backport #44206: nautilus: osd segv in ceph::buffer::v14_2_0::ptr::release (PGTempMap::decode) Resolved

History

#1 Updated by Sage Weil over 1 year ago

if i start the osd manually, i can reproduce the same crash:

[root@smithi200 ~]# ceph-osd -i 2 -f
2020-01-30T14:49:12.918+0000 7fba2e58dec0 -1 Falling back to public interface
2020-01-30T14:49:16.255+0000 7fba2e58dec0 -1 osd.2 517 log_to_monitors {default=true}
*** Caught signal (Segmentation fault) **
 in thread 7fba14551700 thread_name:cfin
 ceph version 15.0.0-10071-g5b5a3a3 (5b5a3a3c2128a66664612cd5b7590c5438ac250e) octopus (dev)
 1: (()+0x12dd0) [0x7fba2c550dd0]
 2: (ceph::buffer::v14_2_0::ptr::c_str()+0xf) [0x562825734e2f]
 3: (ceph::buffer::v14_2_0::list::rebuild(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>)+0x48) [0x562825736ce8]
 4: (ceph::buffer::v14_2_0::list::rebuild()+0xfb) [0x56282573954b]
 5: (OSDService::_add_map_bl(unsigned int, ceph::buffer::v14_2_0::list&)+0x51) [0x562824f2d4b1]
 6: (OSDService::_get_map_bl(unsigned int, ceph::buffer::v14_2_0::list&)+0x348) [0x562824f2dd38]
 7: (OSDService::try_get_map(unsigned int)+0x66b) [0x562824f30d6b]
 8: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x4ae) [0x562824f8a3be]
 9: (C_OnMapCommit::finish(int)+0x1b) [0x562824fde2cb]
 10: (Context::complete(int)+0xd) [0x562824f9306d]
 11: (Finisher::finisher_thread_entry()+0x1a5) [0x562825570f15]
 12: (()+0x82de) [0x7fba2c5462de]
 13: (clone()+0x43) [0x7fba2b2f04b3]
2020-01-30T14:49:17.894+0000 7fba14551700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fba14551700 thread_name:cfin

#2 Updated by Sage Weil over 1 year ago

the second time,

[root@smithi200 ~]# ceph-osd -i 2 -f
2020-01-30T14:49:22.898+0000 7fdd5841bec0 -1 Falling back to public interface
2020-01-30T14:49:24.567+0000 7fdd5841bec0 -1 osd.2 1317 log_to_monitors {default=true}
src/tcmalloc.cc:332] Attempt to free invalid pointer 0x7fdd423e4330 
*** Caught signal (Segmentation fault) **
 in thread 7fdd423e7700 thread_name:cfin
*** Caught signal (Aborted) **
 in thread 7fdd33bca700 thread_name:tp_osd_tp
remove lseek failed (9) Bad file descriptor
 ceph version 15.0.0-10071-g5b5a3a3 (5b5a3a3c2128a66664612cd5b7590c5438ac250e) octopus (dev)
 1: (()+0x12dd0) [0x7fdd563dedd0]
 2: (gsignal()+0x10f) [0x7fdd550b999f]
 3: (abort()+0x127) [0x7fdd550a3cf5]
 4: (()+0x18339) [0x7fdd56efd339]
 5: (()+0x19c79) [0x7fdd56efec79]
 6: (md_config_t::_get_val(ConfigValues const&, std::basic_string_view<char, std::char_traits<char> >, boost::container::small_vector<std::pair<Option const*, boost::variant<boost::blank, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, long, double, bool, entity_addr_t, entity_addrvec_t, std::chrono::duration<long, std::ratio<1l, 1l> >, Option::size_t, uuid_d> const*>, 4ul, void, void>*, std::ostream*) const+0xcf) [0x5647e652c77f]
 7: (md_config_t::get_val_generic[abi:cxx11](ConfigValues const&, std::basic_string_view<char, std::char_traits<char> >) const+0x5c) [0x5647e652ca3c]
 8: (double const md_config_t::get_val<double>(ConfigValues const&, std::basic_string_view<char, std::char_traits<char> >) const+0x35) [0x5647e5ef93d5]
 9: (OSD::get_osd_delete_sleep()+0x63) [0x5647e5e6b9b3]
 10: (PG::do_delete_work(ceph::os::Transaction&)+0x104) [0x5647e5f5cca4]
 11: (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e) [0x5647e610d61e]
 12: (boost::statechart::simple_state<PeeringState::Deleting, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x125) [0x5647e6172995]
 13: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5b) [0x5647e5f6709b]
 14: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x5647e5f59611]
 15: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x5647e5ed07bc]
 16: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc8) [0x5647e5ed0998]
 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x5647e5ec362f]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5647e64f1094]
 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5647e64f3cf4]
 20: (()+0x82de) [0x7fdd563d42de]
 21: (clone()+0x43) [0x7fdd5517e4b3]
2020-01-30T14:49:24.907+0000 7fdd33bca700 -1 *** Caught signal (Aborted) **

#3 Updated by Radoslaw Zarzynski over 1 year ago

It looks the entire `PGTempMap::data` has been corrupted:

[root@cde9ee38ceed ~]# gdb /usr/bin/ceph-osd 1580358022.55757.core
...
(gdb) frame
#21 ceph::buffer::v14_2_0::list::clear (this=0x561cdc15cc50) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/include/buffer.h:1057
1057    in /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/include/buffer.h
(gdb) print *this
$11 = {
  _buffers = {
    _root = {
      next = 0x561ce07b4000
    }, 
    _tail = 0x561cd9b90000, 
    _size = 94682123899904
  }, 
  _carriage = 0x561cd657b900 <ceph::buffer::v14_2_0::list::always_empty_bptr>, 
  _len = 3766187008, 
  _memcopy_count = 22044, 
  last_p = {
    <ceph::buffer::v14_2_0::list::iterator_impl<false>> = {
      bl = 0x561cd9b96800, 
      ls = 0x561ce07ba800, 
      p = {
        cur = 0x561ce064c000
      }, 
      off = 3650586624, 
      p_off = 22044
    }, <No data fields>}, 
  static always_empty_bptr = {
    _raw = 0x0, 
    _off = 0, 
    _len = 0
  }, 
  static CLAIM_DEFAULT = 0, 
  static CLAIM_ALLOW_NONSHAREABLE = 1
}
(gdb) x/9x this
0x561cdc15cc50:    0xe07b4000    0x0000561c    0xd9b90000    0x0000561c
0x561cdc15cc60:    0xe65c9400    0x0000561c    0xd657b900    0x0000561c
0x561cdc15cc70:    0xe07b7400

#4 Updated by Radoslaw Zarzynski over 1 year ago

It looks that a freshly heap-allocated `OSDMap` instance got corrupted:

#26 0x0000561ccd0a9013 in OSDService::try_get_map (this=0x561cd79b5350, epoch=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:1615
1615    in /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc
(gdb) print map
$17 = (OSDMap *) 0x561cd9b49400
(gdb) print map->pg_temp
$18 = std::shared_ptr<PGTempMap> (use count -430127104, weak count 22043) = {get() = 0x561cdc15cc50}
(gdb) print sizeof(map->pg_temp) / 4
$19 = 4
(gdb) x/4x &map->pg_temp
0x561cd9b49560:    0xdc15cc50    0x0000561c    0xdc15cc40    0x0000561c
(gdb) print/d 0x0000561c
$20 = 22044
(gdb) print 0xdc15cc50 - 0xdc15cc40
$21 = 16
OSDMapRef OSDService::try_get_map(epoch_t epoch)
{
  // try from cache first – potentially we could disable
  // the caching mechanism to improve reproduction rates
  // ...
  OSDMap *map = new OSDMap;
  if (epoch > 0) {
    dout(20) << "get_map " << epoch << " - loading and decoding " << map << dendl;
    bufferlist bl;
    if (!_get_map_bl(epoch, bl) || bl.length() == 0) {
      derr << "failed to load OSD map for epoch " << epoch << ", got " << bl.length() << " bytes" << dendl;
      delete map;
      return OSDMapRef();
    }
    map->decode(bl);
  }
  // ...
}

Please note the pretty specific `22044` (0x561c) pattern. It's exactly the same like in the previously dissected `bufferlist` instance (e.g. the `_memcopy_count` field). Is somebody smashing the heap with an address?

#5 Updated by Radoslaw Zarzynski over 1 year ago

`Thread 63 (Thread 0x7f2e36318700 (LWP 55988))` is poisoned as well.

#6  0x0000561ccd0fa310 in OSD::advance_pg (this=0x561cd79b4000, osd_epoch=<optimized out>, pg=0x561ce07ba800, handle=..., rctx=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:8445
        nextmap = warning: RTTI symbol not found for class 'StackStringBuf<4096ul>'
warning: RTTI symbol not found for class 'StackStringBuf<4096ul>'
std::shared_ptr<const OSDMap> (expired, weak count 0) = {
          get() = 0x7f2e36314e00
        }
        newup = std::vector of length -11288607287682, capacity -11288607287846 = {0, 0, 0, 0, 0, 0, 42, 0, 0, 22044, 0, 0, 909201640, 32558, 909201640, 32558, 0, 0, 1, 0, 0, 0, 0, 0, 909201688, 32558, 909201688, 32558, 0, 0, 0, 0, 909201736, 32558, 909201736, 32558, 0, 0, -698894080, 22044, 0, 0, 909201736, 32558, 909201736, 32558, 909201736, 32558, 0, 0, 909201808, 32558, 909201808, 32558, 0, 0, -698894080, 22044, 0, 0, 909201808, 32558, 909201808, 32558, 909201808, 32558, 0, 0, 909201880, 32558, 909201880, 32558, 0, 0, 909201904, 32558, 909201904, 32558, 0, 0, 909201928, 32558, 909201928, 32558, 0, 0, 0, 0, 1528756157, 32558, -528766976, 22044, -846481487, 22044, -615114648, 22044, -615118752, 22044, -615118840, 22044, 198849280, 1210485214, -615118848, 22044, 0, 0, 909202624, 32558, -665742112, 22044, -835336032, 22044, 909202256, 32558, 909202464, 32558, -852304458, 22044, -382408576, 22044, -382408592, 22044, 909202624, 32558, 198849280, 1210485214, 909202552, 32558, 0, 0, 909202624, 32558, -854657489, 22044, -679100416, 22044, -642113056, 22044, 909202552, 32558, 0, 0, 0, 0, 909202512, 32558, 0, 0, -677687608, 22044, -687665280, 22044, -528766976, 22044, 0, 0, 0, 0, -539189760, 22044, -619801296, 22044, 909202256, 32558, 909202256, 32558, 0, 0, 909202432, 32558, 4, 0, 12, 22044, -524632321, 22044, 1550191505, 32558, 0, 0, 10, 255, 0, 0, 0, 0, 511, 32558, -687665184, 22044, -835479792, 22044, -679100416, 22044, -642113056, 22044, 15, 0, 150, 0...}
        up_primary = -855064082
        oldpool = <optimized out>
        new_pg_num = <optimized out>
        newacting = std::vector of length 941552325, capacity 941552021 = {<error reading variable newacting (Cannot access memory at address 0x561c00000001)>
        acting_primary = 22044
        newpool = <optimized out>
        next_epoch = 511

What is `StackStringBuf`? Sounds like a thing worth a look.

Similar issue in `Thread 46 (Thread 0x7f2e35316700 (LWP 55990))`

#6  0x0000561ccd0fa310 in OSD::advance_pg (this=0x561cd79b4000, osd_epoch=<optimized out>, pg=0x561cd9972000, handle=..., rctx=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:8445
        nextmap = std::shared_ptr<const OSDMap> (empty) = {
          get() = 0x561cd657b900 <ceph::buffer::v14_2_0::list::always_empty_bptr>
        }
        newup = std::vector of length 0, capacity 34959109409696
        up_primary = -698894080
        oldpool = <optimized out>
        new_pg_num = <optimized out>
        newacting = std::vector of length 0, capacity 0
        acting_primary = 22044

The corruption is really wide-spread:

Thread 34 (Thread 0x7f2e555ce700 (LWP 55816)):
#0  0x00007f2e5a8c7591 in poll () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000561ccd6ccffb in poll (__timeout=-1, __nfds=4, __fds=0x7f2e555cb4b0) at /usr/include/bits/poll2.h:41
No locals.
#2  SignalHandler::entry (this=0x561cd77ea840) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc:488
        fds = {{
            fd = 27, 
            events = 9, 
            revents = 0
          }, {
            fd = 29, 
            events = 9, 
            revents = 0
          }, {
            fd = 57, 
            events = 9, 
            revents = 0
          }, {
            fd = 60, 
            events = 9, 
            revents = 0
          }, {
            fd = -679100152, 
            events = 22044, 
            revents = 0
          }, {
            fd = -846961022, 
            events = 22044, 
            revents = 0
          }, {
            fd = -679506656, 
            events = 22044, 
            revents = 0
          }, {
            fd = -848246111, 
            events = 22044, 
            revents = 0

#6 Updated by Radoslaw Zarzynski over 1 year ago

The problem is not only about heap corruption. Stacks are affected as well. Moreover, there is an interesting corruption area common to all crashes:
Actually, the familiarly looking pattern at the stack of `SignalHandler` is just a matter of selective clean-up of the `fds` array. It's performed solely for those signals that got handler set:

struct SignalHandler : public Thread {
  // ...

  // thread entry point
  void *entry() override {
    while (!stop) {
      // build fd list
      struct pollfd fds[33];
      // ...
      for (unsigned i=0; i<32; i++) {
        if (handlers[i]) {
          fds[num_fds].fd = handlers[i]->pipefd[0];
          fds[num_fds].events = POLLIN | POLLERR;
          fds[num_fds].revents = 0;
          ++num_fds;
        }
      }
[root@cde9ee38ceed ~]# gdb /usr/bin/ceph-osd 1580395764.801045.core
...
(gdb) thread 15
[Switching to thread 15 (Thread 0x7fdd4fe7a700 (LWP 801060))]
#0  0x00007fdd55173591 in poll () from /lib64/libc.so.6
(gdb) bt    
#0  0x00007fdd55173591 in poll () from /lib64/libc.so.6
#1  0x00005647e64a0ffb in poll (__timeout=-1, __nfds=4, __fds=0x7fdd4fe774b0) at /usr/include/bits/poll2.h:41
#2  SignalHandler::entry (this=0x5647f176e840) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc:488
#3  0x00007fdd563d42de in start_thread () from /lib64/libpthread.so.0
#4  0x00007fdd5517e4b3 in clone () from /lib64/libc.so.6
(gdb) frame 2
#2  SignalHandler::entry (this=0x5647f176e840) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc:488
488    /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc: No such file or directory.
(gdb) print fds
$1 = {{
    fd = 27, 
    events = 9, 
    revents = 0
  }, {
    fd = 29, 
    events = 9, 
    revents = 0
  }, {
    fd = 58, 
    events = 9, 
    revents = 0
  }, {
    fd = 60, 
    events = 9, 
    revents = 0
  }, {
    fd = -243408632, 
    events = 22087, 
    revents = 0
  }, {
    fd = -429807998, 
    events = 22087, 
    revents = 0
  }, {
    fd = -243806944, 
    events = 22087, 
    revents = 0
  }, {
    fd = -431093087, 
    events = 22087, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
--Type <RET> for more, q to quit, c to continue without paging--c
    revents = 0
  }, {
    fd = -771534336, 
    events = 21290, 
    revents = 19556
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = -243408896, 
    events = 22087, 
    revents = 0
  }, {
    fd = 1340569048, 
    events = 32733, 
    revents = 0
  }, {
    fd = 1340569048, 
    events = 32733, 
    revents = 0
  }, {
    fd = -243408632, 
    events = 22087, 
    revents = 0
  }, {
    fd = -243408632, 
    events = 22087, 
    revents = 0
  }, {
    fd = -243686656, 
    events = 22087, 
    revents = 0
  }, {
    fd = -429218781, 
    events = 22087, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 1340569040, 
    events = 32733, 
    revents = 0
  }, {
    fd = -243394968, 
    events = 22087, 
    revents = 0
  }, {
    fd = 1340569088, 
    events = 32733, 
    revents = 0
  }, {
    fd = -243408888, 
    events = 22087, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }}
(gdb) print sizeof(fds) / 4
$2 = 66
(gdb) x/66x fds
0x7fdd4fe774b0:    0x0000001b    0x00000009    0x0000001d    0x00000009
0x7fdd4fe774c0:    0x0000003a    0x00000009    0x0000003c    0x00000009
0x7fdd4fe774d0:    0xf17de108    0x00005647    0xe661a682    0x00005647
0x7fdd4fe774e0:    0xf177cd20    0x00005647    0xe64e0aa1    0x00005647
0x7fdd4fe774f0:    0x00000000    0x00000000    0x00000000    0x00000000
0x7fdd4fe77500:    0x00000000    0x00000000    0x00000000    0x00000000
0x7fdd4fe77510:    0x00000000    0x00000000    0x00000000    0x00000000
0x7fdd4fe77520:    0x00000000    0x00000000    0xd2035200    0x4c64532a
0x7fdd4fe77530:    0x00000000    0x00000000    0xf17de000    0x00005647
0x7fdd4fe77540:    0x4fe775d8    0x00007fdd    0x4fe775d8    0x00007fdd
0x7fdd4fe77550:    0xf17de108    0x00005647    0xf17de108    0x00005647
0x7fdd4fe77560:    0xf179a300    0x00005647    0xe66aa423    0x00005647
0x7fdd4fe77570:    0x00000000    0x00000000    0x4fe775d0    0x00007fdd
0x7fdd4fe77580:    0xf17e1668    0x00005647    0x4fe77600    0x00007fdd
0x7fdd4fe77590:    0xf17de008    0x00005647    0x00000000    0x00000000
0x7fdd4fe775a0:    0x00000000    0x00000000    0x00000000    0x00000000
0x7fdd4fe775b0:    0x00000000    0x00000000
[root@cde9ee38ceed ~]# gdb /usr/bin/ceph-osd 1580395757.800772.core
...
(gdb) thread 19
[Switching to thread 19 (Thread 0x7fba25fec700 (LWP 800787))]
#0  0x00007fba2b2e5591 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fba2b2e5591 in poll () from /lib64/libc.so.6
#1  0x0000562825554ffb in poll (__timeout=-1, __nfds=4, __fds=0x7fba25fe94b0) at /usr/include/bits/poll2.h:41
#2  SignalHandler::entry (this=0x56282f6d4840) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc:488
#3  0x00007fba2c5462de in start_thread () from /lib64/libpthread.so.0
#4  0x00007fba2b2f04b3 in clone () from /lib64/libc.so.6
(gdb) frame 2
#2  SignalHandler::entry (this=0x56282f6d4840) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc:488
488    /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc: No such file or directory.
(gdb) set print pretty on
(gdb) print fds
$1 = {{
    fd = 27, 
    events = 9, 
    revents = 0
  }, {
    fd = 29, 
    events = 9, 
    revents = 0
  }, {
    fd = 58, 
    events = 9, 
    revents = 0
  }, {
    fd = 60, 
    events = 9, 
    revents = 0
  }, {
    fd = 796147976, 
    events = 22056, 
    revents = 0
  }, {
    fd = 627893890, 
    events = 22056, 
    revents = 0
  }, {
    fd = 795749664, 
    events = 22056, 
    revents = 0
  }, {
    fd = 626608801, 
    events = 22056, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
--Type <RET> for more, q to quit, c to continue without paging--
    revents = 0
  }, {
    fd = 425972480, 
    events = 5767, 
    revents = -22502
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 796147712, 
    events = 22056, 
    revents = 0
  }, {
    fd = 637441496, 
    events = 32698, 
    revents = 0
  }, {
    fd = 637441496, 
    events = 32698, 
    revents = 0
  }, {
    fd = 796147976, 
    events = 22056, 
    revents = 0
  }, {
    fd = 796147976, 
    events = 22056, 
    revents = 0
  }, {
    fd = 795869792, 
    events = 22056, 
    revents = 0
  }, {
    fd = 628483107, 
    events = 22056, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 637441488, 
    events = 32698, 
    revents = 0
  }, {
    fd = 796161640, 
    events = 22056, 
    revents = 0
  }, {
    fd = 637441536, 
    events = 32698, 
    revents = 0
  }, {
    fd = 796147720, 
    events = 22056, 
    revents = 0
  }, {
    fd = 0, 
--Type <RET> for more, q to quit, c to continue without paging--
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }}
(gdb) x/66x fds
0x7fba25fe94b0:    0x0000001b    0x00000009    0x0000001d    0x00000009
0x7fba25fe94c0:    0x0000003a    0x00000009    0x0000003c    0x00000009
0x7fba25fe94d0:    0x2f744108    0x00005628    0x256ce682    0x00005628
0x7fba25fe94e0:    0x2f6e2d20    0x00005628    0x25594aa1    0x00005628
0x7fba25fe94f0:    0x00000000    0x00000000    0x00000000    0x00000000
0x7fba25fe9500:    0x00000000    0x00000000    0x00000000    0x00000000
0x7fba25fe9510:    0x00000000    0x00000000    0x00000000    0x00000000
0x7fba25fe9520:    0x00000000    0x00000000    0x1963d300    0xa81a1687
0x7fba25fe9530:    0x00000000    0x00000000    0x2f744000    0x00005628
0x7fba25fe9540:    0x25fe95d8    0x00007fba    0x25fe95d8    0x00007fba
0x7fba25fe9550:    0x2f744108    0x00005628    0x2f744108    0x00005628
0x7fba25fe9560:    0x2f700260    0x00005628    0x2575e423    0x00005628
0x7fba25fe9570:    0x00000000    0x00000000    0x25fe95d0    0x00007fba
0x7fba25fe9580:    0x2f747668    0x00005628    0x25fe9600    0x00007fba
0x7fba25fe9590:    0x2f744008    0x00005628    0x00000000    0x00000000
0x7fba25fe95a0:    0x00000000    0x00000000    0x00000000    0x00000000
0x7fba25fe95b0:    0x00000000    0x00000000
[root@cde9ee38ceed ~]# gdb /usr/bin/ceph-osd 1580358022.55757.core
...
(gdb) thread 34
[Switching to thread 34 (Thread 0x7f2e555ce700 (LWP 55816))]
#0  0x00007f2e5a8c7591 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f2e5a8c7591 in poll () from /lib64/libc.so.6
#1  0x0000561ccd6ccffb in poll (__timeout=-1, __nfds=4, __fds=0x7f2e555cb4b0) at /usr/include/bits/poll2.h:41
#2  SignalHandler::entry (this=0x561cd77ea840) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc:488
#3  0x00007f2e5bb282de in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2e5a8d24b3 in clone () from /lib64/libc.so.6
(gdb) set print pretty on
(gdb) print fds
No symbol "fds" in current context.
(gdb) frame 2
#2  SignalHandler::entry (this=0x561cd77ea840) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc:488
488    /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/global/signal_handler.cc: No such file or directory.
(gdb) print fds
$1 = {{
    fd = 27, 
    events = 9, 
    revents = 0
  }, {
    fd = 29, 
    events = 9, 
    revents = 0
  }, {
    fd = 57, 
    events = 9, 
    revents = 0
  }, {
    fd = 60, 
    events = 9, 
    revents = 0
  }, {
    fd = -679100152, 
    events = 22044, 
    revents = 0
  }, {
    fd = -846961022, 
    events = 22044, 
    revents = 0
  }, {
    fd = -679506656, 
    events = 22044, 
    revents = 0
  }, {
    fd = -848246111, 
    events = 22044, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
--Type <RET> for more, q to quit, c to continue without paging--
    revents = 0
  }, {
    fd = 198849280, 
    events = -30242, 
    revents = 18470
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = -679100416, 
    events = 22044, 
    revents = 0
  }, {
    fd = 1432139224, 
    events = 32558, 
    revents = 0
  }, {
    fd = 1432139224, 
    events = 32558, 
    revents = 0
  }, {
    fd = -679100152, 
    events = 22044, 
    revents = 0
  }, {
    fd = -679100152, 
    events = 22044, 
    revents = 0
  }, {
    fd = -679378336, 
    events = 22044, 
    revents = 0
  }, {
    fd = -846371805, 
    events = 22044, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 1432139216, 
    events = 32558, 
    revents = 0
  }, {
    fd = -679086488, 
    events = 22044, 
    revents = 0
  }, {
    fd = 1432139264, 
    events = 32558, 
    revents = 0
  }, {
    fd = -679100408, 
    events = 22044, 
    revents = 0
  }, {
    fd = 0, 
--Type <RET> for more, q to quit, c to continue without paging--
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }, {
    fd = 0, 
    events = 0, 
    revents = 0
  }}
(gdb) x/66x fds
0x7f2e555cb4b0:    0x0000001b    0x00000009    0x0000001d    0x00000009
0x7f2e555cb4c0:    0x00000039    0x00000009    0x0000003c    0x00000009
0x7f2e555cb4d0:    0xd785c108    0x0000561c    0xcd846682    0x0000561c
0x7f2e555cb4e0:    0xd77f8d20    0x0000561c    0xcd70caa1    0x0000561c
0x7f2e555cb4f0:    0x00000000    0x00000000    0x00000000    0x00000000
0x7f2e555cb500:    0x00000000    0x00000000    0x00000000    0x00000000
0x7f2e555cb510:    0x00000000    0x00000000    0x00000000    0x00000000
0x7f2e555cb520:    0x00000000    0x00000000    0x0bda3300    0x482689de
0x7f2e555cb530:    0x00000000    0x00000000    0xd785c000    0x0000561c
0x7f2e555cb540:    0x555cb5d8    0x00007f2e    0x555cb5d8    0x00007f2e
0x7f2e555cb550:    0xd785c108    0x0000561c    0xd785c108    0x0000561c
0x7f2e555cb560:    0xd7818260    0x0000561c    0xcd8d6423    0x0000561c
0x7f2e555cb570:    0x00000000    0x00000000    0x555cb5d0    0x00007f2e
0x7f2e555cb580:    0xd785f668    0x0000561c    0x555cb600    0x00007f2e
0x7f2e555cb590:    0xd785c008    0x0000561c    0x00000000    0x00000000
0x7f2e555cb5a0:    0x00000000    0x00000000    0x00000000    0x00000000
0x7f2e555cb5b0:    0x00000000    0x00000000

This trait might be quite won't useful for watchpoint-augmented debugging as the corruption seems to somehow follow the `fds` location.

#7 Updated by Sage Weil over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Radoslaw Zarzynski

#9 Updated by Sage Weil about 1 year ago

  • Status changed from In Progress to Pending Backport
  • Backport set to nautilus
  • Pull request ID set to 33336

#10 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #44206: nautilus: osd segv in ceph::buffer::v14_2_0::ptr::release (PGTempMap::decode) added

#11 Updated by Nathan Cutler about 1 year ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

#12 Updated by Nathan Cutler 10 months ago

  • Related to Bug #46443: ceph_osd crash in _committed_osd_maps when failed to encode first inc map added

Also available in: Atom PDF