Project

General

Profile

Bug #53002

crash BlueStore::Onode::put from BlueStore::TransContext::~TransContext

Added by Dan van der Ster 11 months ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've just seen this crash in the wild running 15.2.14. Maybe a dup of #50788?

   -14> 2021-10-21T09:42:31.079+0200 7f88e1b2c700  5 prioritycache tune_memory target: 3221225472 mapped: 3201368064 unmapped: 466845696 heap: 3668213760 old mem: 1932735267 new mem: 19327352
67
   -13> 2021-10-21T09:42:31.924+0200 7f88dde53700 10 monclient: tick
   -12> 2021-10-21T09:42:31.924+0200 7f88dde53700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-10-21T09:42:01.925680+0200)
   -11> 2021-10-21T09:42:32.062+0200 7f88dd323700  5 bluestore(/var/lib/ceph/osd/ceph-138) _kv_sync_thread utilization: idle 9.861052482s of 10.001157459s, submitted: 477
   -10> 2021-10-21T09:42:32.080+0200 7f88e1b2c700  5 prioritycache tune_memory target: 3221225472 mapped: 3201417216 unmapped: 466796544 heap: 3668213760 old mem: 1932735267 new mem: 19327352
67
    -9> 2021-10-21T09:42:32.080+0200 7f88e1b2c700  5 bluestore.MempoolThread(0x55a9f3e04a08) _resize_shards cache_size: 1932735267 kv_alloc: 889192448 kv_used: 586783984 meta_alloc: 813694976
 meta_used: 511074366 data_alloc: 218103808 data_used: 0
    -8> 2021-10-21T09:42:32.115+0200 7f88cd509700  0 <cls> /builddir/build/BUILD/ceph-15.2.14/src/cls/lock/cls_lock.cc:290: Could not read list of current lockers off disk: (2) No such file o
r directory
    -7> 2021-10-21T09:42:32.925+0200 7f88dde53700 10 monclient: tick
    -6> 2021-10-21T09:42:32.925+0200 7f88dde53700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-10-21T09:42:02.925792+0200)
    -5> 2021-10-21T09:42:33.082+0200 7f88e1b2c700  5 prioritycache tune_memory target: 3221225472 mapped: 3201490944 unmapped: 466722816 heap: 3668213760 old mem: 1932735267 new mem: 19327352
67
    -4> 2021-10-21T09:42:33.111+0200 7f88c9501700  0 <cls> /builddir/build/BUILD/ceph-15.2.14/src/cls/lock/cls_lock.cc:290: Could not read list of current lockers off disk: (2) No such file o
r directory
    -3> 2021-10-21T09:42:33.206+0200 7f88c8d00700  5 osd.138 360301 heartbeat osd_stat(store_statfs(0xa9ee097000/0x193950000/0xdf90000000, data 0x3408e34bb0/0x340e617000, compress 0x0/0x0/0x0
, omap 0x2dc5721e, meta 0x165cf8de2), peers [1,2,3,12,16,21,23,24,27,29,34,35,41,42,45,49,52,55,63,68,70,71,72,77,79,82,83,85,105,108,113,119,124,131,133,137,139,149,150,152,156,161,167,170,1
75,180,206,211,212,213,217,236,240,245,247,250,252,259,265,269,272,273,274,275,277,280,287] op hist [])
    -2> 2021-10-21T09:42:33.367+0200 7f88cd509700  0 <cls> /builddir/build/BUILD/ceph-15.2.14/src/cls/lock/cls_lock.cc:290: Could not read list of current lockers off disk: (2) No such file o
r directory
    -1> 2021-10-21T09:42:33.440+0200 7f88cc507700  0 <cls> /builddir/build/BUILD/ceph-15.2.14/src/cls/lock/cls_lock.cc:290: Could not read list of current lockers off disk: (2) No such file o
r directory
     0> 2021-10-21T09:42:33.457+0200 7f88e232d700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f88e232d700 thread_name:bstore_kv_final

 ceph version 15.2.14-7 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)
 1: (()+0xf630) [0x7f88f0f8f630]
 2: (BlueStore::Onode::put()+0x2eb) [0x55a9e87de1fb]
 3: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::Onode>, boost::intrusive_ptr<BlueStore::Onode>, std::_Identity<boost::intrusive_ptr<BlueStore::Onode> >, std::less<boost::intrusive_ptr<BlueStore::Onode> >, std::allocator<boost::intrusive_ptr<BlueStore::Onode> > >::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<BlueStore::Onode> >*)+0x2d) [0x55a9e888297d]
 4: (BlueStore::TransContext::~TransContext()+0x107) [0x55a9e8882aa7]
 5: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x231) [0x55a9e8854041]
 6: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1fc) [0x55a9e8854b7c]
 7: (BlueStore::_kv_finalize_thread()+0x552) [0x55a9e8857a52]
 8: (BlueStore::KVFinalizeThread::entry()+0xd) [0x55a9e8887edd]
 9: (()+0x7ea5) [0x7f88f0f87ea5]
 10: (clone()+0x6d) [0x7f88efe4a9fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The fsck is clean:

# ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-138/
fsck success

We have the coredump and could check anything...

(gdb) bt
#0  0x00007f88f0f8f4fb in raise () from /lib64/libpthread.so.0
#1  0x000055a9e89501b2 in reraise_fatal (signum=11)
    at /usr/src/debug/ceph-15.2.14/src/global/signal_handler.cc:326
#2  handle_fatal_signal(int) () at /usr/src/debug/ceph-15.2.14/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  0x000055a9e87de1fb in lock (this=<optimized out>)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/mutex:110
#5  BlueStore::Onode::put (this=0x55aa7ea2b440)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.cc:3588
#6  0x000055a9e888297d in intrusive_ptr_release (o=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.h:3370
#7  ~intrusive_ptr (this=0x55aa49c74c20, __in_chrg=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
#8  destroy<boost::intrusive_ptr<BlueStore::Onode> > (this=0x55aa8e89d578, __p=0x55aa49c74c20)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:140
#9  destroy<boost::intrusive_ptr<BlueStore::Onode> > (__a=..., __p=0x55aa49c74c20)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/alloc_traits.h:487
#10 _M_destroy_node (this=0x55aa8e89d578, __p=0x55aa49c74c00)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:661
#11 _M_drop_node (this=0x55aa8e89d578, __p=0x55aa49c74c00)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:669
#12 std::_Rb_tree<boost::intrusive_ptr<BlueStore::Onode>, boost::intrusive_ptr<BlueStore::Onode>, std::_Identity<boost::intrusive_ptr<BlueStore::Onode> >, std::less<boost::intrusive_ptr<BlueStore::Onode> >, std::allocator<boost::intrusive_ptr<BlueStore::Onode> > >::_M_erase (
    this=this@entry=0x55aa8e89d578, __x=0x55aa49c74c00)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:1874
#13 0x000055a9e8882aa7 in ~_Rb_tree (this=0x55aa8e89d578, __in_chrg=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.h:1595
#14 ~set (this=0x55aa8e89d578, __in_chrg=<optimized out>)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_set.h:281
#15 BlueStore::TransContext::~TransContext (this=0x55aa8e89d500, __in_chrg=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.h:1594
#16 0x000055a9e8854041 in ~TransContext (this=0x55aa8e89d500, __in_chrg=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.cc:11993
#17 BlueStore::_txc_finish(BlueStore::TransContext*) ()
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.cc:11993
#18 0x000055a9e8854b7c in BlueStore::_txc_state_proc(BlueStore::TransContext*) ()
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.cc:11709
#19 0x000055a9e8857a52 in BlueStore::_kv_finalize_thread() ()
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.cc:12556
#20 0x000055a9e8887edd in BlueStore::KVFinalizeThread::entry (this=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.h:1912
#21 0x00007f88f0f87ea5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f88efe4a9fd in clone () from /lib64/libc.so.6
(gdb) 

(gdb) up
#1  0x000055a9e89501b2 in reraise_fatal (signum=11)
    at /usr/src/debug/ceph-15.2.14/src/global/signal_handler.cc:326
326        reraise_fatal(signum);
(gdb) up
#2  handle_fatal_signal(int) ()
    at /usr/src/debug/ceph-15.2.14/src/global/signal_handler.cc:326
326        reraise_fatal(signum);
(gdb) up
#3  <signal handler called>
(gdb) up
#4  0x000055a9e87de1fb in lock (this=<optimized out>)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/mutex:110
110    /opt/rh/devtoolset-8/root/usr/include/c++/8/mutex: No such file or directory.
(gdb) up
#5  BlueStore::Onode::put (this=0x55aa7ea2b440)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.cc:3588
3588          ocs->lock.lock();
(gdb) list
3583        ocs->lock.lock();
3584        // It is possible that during waiting split_cache moved us to different OnodeCacheShard.
3585        while (ocs != c->get_onode_cache()) {
3586          ocs->lock.unlock();
3587          ocs = c->get_onode_cache();
3588          ocs->lock.lock();
3589        }
3590        bool need_unpin = pinned;
3591        pinned = pinned && nref > 2; // intentionally use > not >= as we have
3592                                     // +1 due to pinned state
(gdb) up
#6  0x000055a9e888297d in intrusive_ptr_release (o=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.h:3370
3370      o->put();
(gdb) list
3365    
3366    static inline void intrusive_ptr_add_ref(BlueStore::Onode *o) {
3367      o->get();
3368    }
3369    static inline void intrusive_ptr_release(BlueStore::Onode *o) {
3370      o->put();
3371    }
3372    
3373    static inline void intrusive_ptr_add_ref(BlueStore::OpSequencer *o) {
3374      o->get();
(gdb) up
#7  ~intrusive_ptr (this=0x55aa49c74c20, __in_chrg=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
98            if( px != 0 ) intrusive_ptr_release( px );
(gdb) list
93            if( px != 0 ) intrusive_ptr_add_ref( px );
94        }
95    
96        ~intrusive_ptr()
97        {
98            if( px != 0 ) intrusive_ptr_release( px );
99        }
100    
101    #if !defined(BOOST_NO_MEMBER_TEMPLATES) || defined(BOOST_MSVC6_MEMBER_TEMPLATES)
102    
(gdb) up
#8  destroy<boost::intrusive_ptr<BlueStore::Onode> > (this=0x55aa8e89d578, __p=0x55aa49c74c20)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:140
140    /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h: No such file or directory.
(gdb) up
#9  destroy<boost::intrusive_ptr<BlueStore::Onode> > (__a=..., __p=0x55aa49c74c20)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/alloc_traits.h:487
487    /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/alloc_traits.h: No such file or directory.
(gdb) up
#10 _M_destroy_node (this=0x55aa8e89d578, __p=0x55aa49c74c00)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:661
661    /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h: No such file or directory.
(gdb) up
#11 _M_drop_node (this=0x55aa8e89d578, __p=0x55aa49c74c00)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:669
669    in /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h
(gdb) up
#12 std::_Rb_tree<boost::intrusive_ptr<BlueStore::Onode>, boost::intrusive_ptr<BlueStore::Onode>, std::_Identity<boost::intrusive_ptr<BlueStore::Onode> >, std::less<boost::intrusive_ptr<BlueStore::Onode> >, std::allocator<boost::intrusive_ptr<BlueStore::Onode> > >::_M_erase (
    this=this@entry=0x55aa8e89d578, __x=0x55aa49c74c00)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:1874
1874    in /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h
(gdb) up
#13 0x000055a9e8882aa7 in ~_Rb_tree (this=0x55aa8e89d578, __in_chrg=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.h:1595
1595          delete deferred_txn;
(gdb) list
1590          if (on_commits) {
1591        oncommits.swap(*on_commits);
1592          }
1593        }
1594        ~TransContext() {
1595          delete deferred_txn;
1596        }
1597    
1598        void write_onode(OnodeRef &o) {
1599          onodes.insert(o);
(gdb) 

Related issues

Related to bluestore - Bug #50788: crash in BlueStore::Onode::put() Duplicate
Related to bluestore - Bug #47740: OSD crash when increase pg_num Duplicate
Duplicates bluestore - Bug #56174: rook-ceph-osd crash randomly Duplicate
Duplicates bluestore - Bug #54727: crash: __pthread_mutex_lock() Duplicate
Duplicates bluestore - Bug #56200: crash: ceph::buffer::ptr::release() Duplicate
Duplicates bluestore - Bug #54650: crash: BlueStore::Onode::put() Duplicate
Copied to bluestore - Backport #53608: pacific: crash BlueStore::Onode::put from BlueStore::TransContext::~TransContext Resolved
Copied to bluestore - Backport #53609: octopus: crash BlueStore::Onode::put from BlueStore::TransContext::~TransContext Resolved

History

#1 Updated by Igor Fedotov 11 months ago

  • Related to Bug #50788: crash in BlueStore::Onode::put() added

#2 Updated by Igor Fedotov 11 months ago

Dan van der Ster wrote:

We've just seen this crash in the wild running 15.2.14. Maybe a dup of #50788?

I'm pretty sure it is...

Aren't there any indications of a recent PG split?

#3 Updated by Dan van der Ster 11 months ago

Igor Fedotov wrote:

Dan van der Ster wrote:

We've just seen this crash in the wild running 15.2.14. Maybe a dup of #50788?

I'm pretty sure it is...

Aren't there any indications of a recent PG split?

Not recently AFAIK... we have nopgchange set on all the pools.

#4 Updated by Dan van der Ster 11 months ago

More context: the cluster was upgraded from 14.2.20 to 15.2.14 two weeks ago. We've never seen this before today; it happened only once on only this OSD so far.

#5 Updated by Dan van der Ster 11 months ago

In frame 7 I can print the Onode. Some of the vals look quite strange (but I don't know if that's normal):

(gdb) f
#7  ~intrusive_ptr (this=0x55aa49c74c20, __in_chrg=<optimized out>)
    at /usr/src/debug/ceph-15.2.14/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:98
98            if( px != 0 ) intrusive_ptr_release( px );
(gdb) list
93            if( px != 0 ) intrusive_ptr_add_ref( px );
94        }
95    
96        ~intrusive_ptr()
97        {
98            if( px != 0 ) intrusive_ptr_release( px );
99        }
100    
101    #if !defined(BOOST_NO_MEMBER_TEMPLATES) || defined(BOOST_MSVC6_MEMBER_TEMPLATES)
102    
(gdb) p px
$11 = (BlueStore::Onode *) 0x55aa7ea2b440
(gdb) p *px
$12 = {nref = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 1024138560}, 
    static is_always_lock_free = true}, c = 0x200, oid = {hobj = {static POOL_META = -1, 
      static POOL_TEMP_START = -2, oid = {
        name = <error reading variable: Cannot access memory at address 0x55aaffffffe7>}, 
      snap = {val = 8295752894954156584}, hash = 543712117, max = 102, 
      nibblewise_key_cache = 544370464, hash_reverse_bits = 1701996900, pool = 521610949731, 
      nspace = "cta-cristina", key = ""}, generation = 18446744073709551615, shard_id = {
      id = -1 '\377', static NO_SHARD = {id = -1 '\377', 
        static NO_SHARD = <same as static member of an already seen type>}}, max = false, 
    static NO_GEN = 18446744073709551615}, key = "", 
  lru_item = {<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>> = {<boost::intrusive::list_node<void*>> = {next_ = 0x0, 
        prev_ = 0x0}, <boost::intrusive::hook_tags_definer<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>, 0>> = {<No data fields>}, <No data fields>}, <No data fields>}, onode = {nid = 0, size = 0, 
    attrs = std::map with 0 elements, 
    extent_map_shards = std::vector of length 0, capacity 0, expected_object_size = 0, 
    expected_write_size = 0, alloc_hint_flags = 0, flags = 0 '\000'}, exists = false, 
  cached = false, pinned = {_M_base = {static _S_alignment = 1, _M_i = false}, 
    static is_always_lock_free = true}, extent_map = {onode = 0x55aa7ea2b440, 
    extent_map = {<boost::intrusive::set_impl<boost::intrusive::bhtraits<BlueStore::Extent, boost::intrusive::rbtree_node_traits<void*, true>, (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 3>, void, void, unsigned long, true, void>> = {<boost::intrusive::bstree_impl<boost::intrusive::bhtraits<BlueStore::Extent, boost::intrusive::rbtree_node_traits<void*, true>, (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 3>, void, void, unsigned long, true, (boost::intrusive::algo_types)5, void>> = {<boost::intrusive::bstbase<boost::intrusive::bhtraits<BlueStore::Extent, boost::intrusive::rbtree_node_traits<void*, true>, (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 3>, void, void, true, unsigned long, (boost::intrusive::algo_types)5, void>> = {<boost::intrusive::bstbase_hack<boost::intrusive::bhtraits<BlueStore::Extent, boost::intrusive::rbtree_node_traits<void*, true>, (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 3>, void, void, true, unsigned long, (boost::intrusive::algo_types)5, void>> = {<boost::intrusive::detail::size_holder<true, unsigned long, void>> = {
                static constant_time_size = <optimized out>, 
                size_ = 0}, <boost::intrusive::bstbase2<boost::intrusive::bhtraits<BlueStore::Extent, boost::intrusive::rbtree_node_traits<void*, true>, (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 3>, void, void, (boost::intrusive::algo_types)5, void>> = {<boost::intrusive::detail::ebo_functor_holder<boost::intrusive::tree_value_compare<BlueStore::Extent*, std::less<BlueStore::Extent>, boost::move_detail::identity<BlueStore::Extent>, bool, true>, void, false>> = {<boost::intrusive::tree_value_compare<BlueStore::Extent*, std::less<BlueStore::Extent>, boost::move_detail::identity<BlueStore::Extent>, bool, true>> = {<boost::intrusive::detail::ebo_functor_holder<std::less<BlueStore::Extent>, void, false>> = {<std::less<BlueStore::Extent>> = {<std::binary_function<BlueStore::Extent, BlueStore::Extent, bool>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <boost::intrusive::bstbase3<boost::intrusive::bhtraits<BlueStore::Extent, boost::intrusive::rbtree_node_traits<void*, true>, (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 3>, (boost::intrus---Type <return> to continue, or q <return> to quit---
ive::algo_types)5, void>> = {static safemode_or_autounlink = <optimized out>, 
                  static stateful_value_traits = <optimized out>, 
                  static has_container_from_iterator = <optimized out>, 
                  holder = {<boost::intrusive::bhtraits<BlueStore::Extent, boost::intrusive::rbtree_node_traits<void*, true>, (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 3>> = {<boost::intrusive::bhtraits_base<BlueStore::Extent, boost::intrusive::compact_rbtree_node<void*>*, boost::intrusive::dft_tag, 3>> = {<No data fields>}, 
                      static link_mode = boost::intrusive::safe_link}, 
                    root = {<boost::intrusive::compact_rbtree_node<void*>> = {parent_ = 0x0, 
                        left_ = 0x55aa7ea2b540, 
                        right_ = 0x55aa7ea2b540}, <No data fields>}}}, <No data fields>}, <No data fields>}, <No data fields>}, static constant_time_size = true, 
          static stateful_value_traits = <optimized out>, 
          static safemode_or_autounlink = true}, 
        static constant_time_size = true}, <No data fields>}, 
    spanning_blob_map = std::map with 0 elements, 
    shards = std::vector of length 0, capacity 0, inline_bl = {_buffers = {_root = {
          next = 0x55aa7ea2b5c0}, _tail = 0x55aa7ea2b5c0}, 
      _carriage = 0x55a9f17a8d90 <ceph::buffer::v15_2_0::list::always_empty_bptr>, _len = 0, 
      _num = 0, static always_empty_bptr = {_raw = 0x0, _off = 0, _len = 0}}, 
    needs_reshard_begin = 0, needs_reshard_end = 0}, 
  flushing_count = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 0}, 
    static is_always_lock_free = true}, waiting_count = {<std::__atomic_base<int>> = {
      static _S_alignment = 4, _M_i = 0}, static is_always_lock_free = true}, 
  flush_lock = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, 
          __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {
            __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
        __align = 0}}, <No data fields>}, flush_cond = {_M_cond = {__data = {__lock = 1, 
        __futex = 0, __total_seq = 18446744073709551615, __wakeup_seq = 0, __woken_seq = 0, 
        __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, 
      __size = "\001\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377", '\000' <repeats 31 times>, __align = 1}}}
(gdb) 

E.g. down in frame 5, `c` has address 0x200 ?!!

(gdb) f
#5  BlueStore::Onode::put (this=0x55aa7ea2b440)
    at /usr/src/debug/ceph-15.2.14/src/os/bluestore/BlueStore.cc:3588
3588          ocs->lock.lock();
(gdb) list
3583        ocs->lock.lock();
3584        // It is possible that during waiting split_cache moved us to different OnodeCacheShard.
3585        while (ocs != c->get_onode_cache()) {
3586          ocs->lock.unlock();
3587          ocs = c->get_onode_cache();
3588          ocs->lock.lock();
3589        }
3590        bool need_unpin = pinned;
3591        pinned = pinned && nref > 2; // intentionally use > not >= as we have
3592                                     // +1 due to pinned state
(gdb) p c
$16 = (BlueStore::Collection *) 0x200
(gdb) p *c
Cannot access memory at address 0x200

#6 Updated by Neha Ojha 11 months ago

  • Assignee set to Igor Fedotov

#7 Updated by Igor Fedotov 11 months ago

  • Status changed from New to In Progress
  • Pull request ID set to 43770

#8 Updated by Igor Fedotov 11 months ago

  • Backport set to pacific, octopus

#9 Updated by Igor Fedotov 11 months ago

  • Status changed from In Progress to Pending Backport

#10 Updated by Igor Fedotov 11 months ago

  • Status changed from Pending Backport to Fix Under Review

#11 Updated by Igor Fedotov 10 months ago

  • Status changed from Fix Under Review to Pending Backport

#12 Updated by Backport Bot 10 months ago

  • Copied to Backport #53608: pacific: crash BlueStore::Onode::put from BlueStore::TransContext::~TransContext added

#13 Updated by Backport Bot 10 months ago

  • Copied to Backport #53609: octopus: crash BlueStore::Onode::put from BlueStore::TransContext::~TransContext added

#14 Updated by Igor Fedotov 7 months ago

  • Status changed from Pending Backport to Resolved

#15 Updated by Igor Fedotov 3 months ago

  • Duplicates Bug #56174: rook-ceph-osd crash randomly added

#16 Updated by Igor Fedotov about 2 months ago

  • Duplicates Bug #54727: crash: __pthread_mutex_lock() added

#17 Updated by Igor Fedotov about 2 months ago

  • Duplicates Bug #56200: crash: ceph::buffer::ptr::release() added

#18 Updated by Igor Fedotov about 2 months ago

  • Duplicates Bug #54650: crash: BlueStore::Onode::put() added

#19 Updated by Igor Fedotov about 2 months ago

  • Related to Bug #47740: OSD crash when increase pg_num added

#20 Updated by Sven Kieske about 2 months ago

according to https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/PPWIFPEI3EVBU3GQYYO6ABGF23WR5SGZ/ this is not resolved yet, could this be reopened, please?

#21 Updated by Igor Fedotov about 2 months ago

  • Status changed from Resolved to New

Looks like this hasn't been completely fixed yet.
We've got a bunch of new tickets from Telemetry bot which indicate the same or similar symptoms (Onode::put is primarily involved) for Ceph releases which had got PR #43770 (and its backports).

Some of the cases from the field I observed personally:
1) 15.2.16
Aug 05 23:34:51 ceph-osd2861: ** Caught signal (Segmentation fault) *
Aug 05 23:34:51 ceph-osd2861: in thread 7f08cf3a0700 thread_name:tp_osd_tp
Aug 05 23:34:51 ceph-osd2861: ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
Aug 05 23:34:51 ceph-osd2861: 1: (()+0x12730) [0x7f08ec91e730]
Aug 05 23:34:51 ceph-osd2861: 2: (ceph::buffer::v15_2_0::ptr::release()+0x26) [0x5650f3904d26]
Aug 05 23:34:51 ceph-osd2861: 3: (BlueStore::Onode::put()+0x1a9) [0x5650f35b6a79]
Aug 05 23:34:51 ceph-osd2861: 4: (std::_Hashtable<ghobject_t, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<ghobject_t>, std::hash<ghobject_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, true>)+0x64) [0x5650f3662ca4]
Aug 05 23:34:51 ceph-osd2861: 5: (BlueStore::OnodeSpace::_remove(ghobject_t const&)+0x290) [0x5650f35b68a0]
Aug 05 23:34:51 ceph-osd2861: 6: (LruOnodeCacheShard::_trim_to(unsigned long)+0xdb) [0x5650f36631db]
Aug 05 23:34:51 ceph-osd2861: 7: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>&)+0x48d) [0x5650f35b74cd]
Aug 05 23:34:51 ceph-osd2861: 8: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x453) [0x5650f35fdac3]
Aug 05 23:34:51 ceph-osd2861: 9: (BlueStore::_txc_add_transaction(BlueStore::TransContext
, ceph::os::Transaction*)+0x1dc3) [0x5650f3633353]
Aug 05 23:34:51 ceph-osd2861: 10: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x408) [0x5650f3634778]
Aug 05 23:34:51 ceph-osd2861: 11: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x5650f32e7c14]
Aug 05 23:34:51 ceph-osd2861: 12: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xdf4) [0x5650f347b804]
Aug 05 23:34:51 ceph-osd2861: 13: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x5650f348ad57]
Aug 05 23:34:51 ceph-osd2861: 14: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x57) [0x5650f331d917]
Aug 05 23:34:51 ceph-osd2861: 15: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x62f) [0x5650f32c14df]
Aug 05 23:34:51 ceph-osd2861: 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x325) [0x5650f3159d35]
Aug 05 23:34:51 ceph-osd2861: 17: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x64) [0x5650f339dea4]
Aug 05 23:34:51 ceph-osd2861: 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12fa) [0x5650f317678a]
Aug 05 23:34:51 ceph-osd2861: 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x5650f37801f4]
Aug 05 23:34:51 ceph-osd2861: 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5650f3782c70]
Aug 05 23:34:51 ceph-osd2861: 21: (()+0x7fa3) [0x7f08ec913fa3]
Aug 05 23:34:51 ceph-osd2861: 22: (clone()+0x3f) [0x7f08ec4beeff]

or

Aug 05 00:33:29 ceph-osd2863: ** Caught signal (Segmentation fault) *
Aug 05 00:33:29 ceph-osd2863: in thread 7f4613a22700 thread_name:bstore_kv_final
Aug 05 00:33:29 ceph-osd2863: ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
Aug 05 00:33:29 ceph-osd2863: 1: (()+0x12730) [0x7f461ff7e730]
Aug 05 00:33:29 ceph-osd2863: 2: (BlueStore::Onode::put()+0x193) [0x564c15db8a63]
Aug 05 00:33:29 ceph-osd2863: 3: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::Onode>, boost::intrusive_ptr<BlueStore::Onode>, std::_Identity<boost::intrusive_ptr<BlueStore::Onode> >, std::less<boost::intrusive_ptr<BlueStore::Onode> >, std::allocator<boost::intrusive_ptr<BlueStore::Onode> > >::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<BlueStore::Onode> >)+0x2d) [0x564c15e6460d]
Aug 05 00:33:29 ceph-osd2863: 4: (BlueStore::TransContext::~TransContext()+0x117) [0x564c15e64747]
Aug 05 00:33:29 ceph-osd2863: 5: (BlueStore::_txc_finish(BlueStore::TransContext
)+0x24b) [0x564c15e0bb8b]
Aug 05 00:33:29 ceph-osd2863: 6: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x234) [0x564c15e23744]
Aug 05 00:33:29 ceph-osd2863: 7: (BlueStore::_kv_finalize_thread()+0x552) [0x564c15e2e3e2]
Aug 05 00:33:29 ceph-osd2863: 8: (BlueStore::KVFinalizeThread::entry()+0xd) [0x564c15e69b8d]
Aug 05 00:33:29 ceph-osd2863: 9: (()+0x7fa3) [0x7f461ff73fa3]
Aug 05 00:33:29 ceph-osd2863: 10: (clone()+0x3f) [0x7f461fb1eeff]

2) different cluster at 15.2.16
backtrace:
0: (()+0x12730) [0x7fe8875d1730]
1: (gsignal()+0x10b) [0x7fe8870b07bb]
2: (abort()+0x121) [0x7fe88709b535]
3: (()+0x2240f) [0x7fe88709b40f]
4: (()+0x30102) [0x7fe8870a9102]
5: (()+0xeb47ca) [0x55e2237177ca]
6: (BlueStore::Onode::put()+0x2b1) [0x55e22372ab81]
7: (std::_Rb_tree<boost::intrusive_ptrBlueStore::Onode, boost::intrusive_ptrBlueStore::Onode, std::_Identity<boost::intrusive_ptrBlueStore::Onode >, std::less<boost::intrusive_ptrBlueStore::Onode >, std::allocator<boost::intrusive_ptrBlueStore::Onode > >::_M_erase(std::_Rb_tree_node<boost::intrusive_ptrBlueStore::Onode >)+0x2d) [0x55e2237d660d]
8: (BlueStore::TransContext::~TransContext()+0x124) [0x55e2237d6754]
9: (BlueStore::_txc_finish(BlueStore::TransContext)+0x24b) [0x55e22377db8b]
10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x234) [0x55e223795744]
11: (BlueStore::_kv_finalize_thread()+0x552) [0x55e2237a03e2]
12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x55e2237dbb8d]
13: (()+0x7fa3) [0x7fe8875c6fa3]
14: (clone()+0x3f) [0x7fe887171eff]

3) 16.2.9
Caught signal (Segmentation fault) *
2022-08-02 00:33:00 Ceph04 osd.21 in thread 7f2853f74700 thread_name:tp_osd_tp
2022-08-02 00:33:00 Ceph04 osd.21 ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
2022-08-02 00:33:00 Ceph04 osd.21 1: /lib64/libpthread.so.0(+0x168c0) [0x7f287a1e98c0]
2022-08-02 00:33:00 Ceph04 osd.21 2: (ceph::buffer::v15_2_0::ptr::release()+0xf) [0x55670639336f]
2022-08-02 00:33:00 Ceph04 osd.21 3: (BlueStore::Onode::put()+0x1bc) [0x55670601feac]
2022-08-02 00:33:00 Ceph04 osd.21 4: (std::_detail::_Hashtable_alloc<mempool::pool_allocator >, true> > >::_M_deallocate_node(std::_detail::_Hash_node<std::pair >, true>
)+0x35) [0x5567060d2365]</std::pair</mempool::pool_allocator
2022-08-02 00:33:00 Ceph04 osd.21 5: (std::Hashtable >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair > >, std::detail::_Select1st, std::equal_to, std::hash, std::detail::_Mod_range_hashing, std::detail::_Default_ranged_hash, std::detail::_Prime_rehash_policy, std::detail::_Hashtable_traits >::_M_erase(unsigned long, std::detail::_Hash_node_base*, std::_detail::_Hash_node<std::pair >, true>)+0x53) [0x5567060d27a3]</std::pair
2022-08-02 00:33:00 Ceph04 osd.21 6: (BlueStore::OnodeSpace::_remove(ghobject_t const&)+0x12c) [0x55670601fb5c]
2022-08-02 00:33:00 Ceph04 osd.21 7: (LruOnodeCacheShard::_trim_to(unsigned long)+0xce) [0x5567060d350e]
2022-08-02 00:33:00 Ceph04 osd.21 8: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr&)+0x152) [0x5567060206a2]
2022-08-02 00:33:00 Ceph04 osd.21 9: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x299) [0x55670607fc39]
2022-08-02 00:33:00 Ceph04 osd.21 10: (BlueStore::_txc_add_transaction(BlueStore::TransContext
, ceph::os::Transaction*)+0x1d32) [0x55670608b722]
2022-08-02 00:33:00 Ceph04 osd.21 11: (BlueStore::queue_transactions(boost::intrusive_ptr&, std::vector >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x2fa) [0x5567060a555a]
2022-08-02 00:33:00 Ceph04 osd.21 12: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector >&, boost::intrusive_ptr)+0x54) [0x556705ce5cf4]
2022-08-02 00:33:00 Ceph04 osd.21 13: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr, ECSubWrite&, ZTracer::Trace const&)+0xa4d) [0x556705eff87d]
2022-08-02 00:33:00 Ceph04 osd.21 14: (ECBackend::try_reads_to_commit()+0x2509) [0x556705f10759]
2022-08-02 00:33:00 Ceph04 osd.21 15: (ECBackend::check_ops()+0x1c) [0x556705f1202c]
2022-08-02 00:33:00 Ceph04 osd.21 16: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply const&, ZTracer::Trace const&)+0xde) [0x556705f1217e]
2022-08-02 00:33:00 Ceph04 osd.21 17: (ECBackend::_handle_message(boost::intrusive_ptr)+0x1cf) [0x556705f17cef]
2022-08-02 00:33:00 Ceph04 osd.21 18: (PGBackend::handle_message(boost::intrusive_ptr)+0x87) [0x556705d34117]
2022-08-02 00:33:00 Ceph04 osd.21 19: (PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x684) [0x556705cd5264]
2022-08-02 00:33:00 Ceph04 osd.21 20: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&)+0x159) [0x556705b5ee39]
2022-08-02 00:33:00 Ceph04 osd.21 21: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x67) [0x556705dbaef7]
2022-08-02 00:33:00 Ceph04 osd.21 22: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xcf5) [0x556705b7c625]
2022-08-02 00:33:00 Ceph04 osd.21 23: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x5567061e02ec]
2022-08-02 00:33:00 Ceph04 osd.21 24: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5567061e37b0]
2022-08-02 00:33:00 Ceph04 osd.21 25: /lib64/libpthread.so.0(+0xa6ea) [0x7f287a1dd6ea]
2022-08-02 00:33:00 Ceph04 osd.21 26: clone()

#22 Updated by Igor Fedotov about 2 months ago

4) Quincy case from Telemetry: https://tracker.ceph.com/issues/56382

#23 Updated by Igor Fedotov about 1 month ago

  • Status changed from New to In Progress

Also available in: Atom PDF