Bug #58215
closedactive mgr crashes with segfault when running 'ceph osd purge'
0%
Description
I know that Nautilus is out of support already, so maybe this just ends up for posterity but maybe we found something worthwhile to investigate.
Anyway: as we're upgrading some old clusters we're currently working through Nautilus. However, we do have a reliable reproducer for a clean installation within our automated test suite which creates a fresh cluster every time and repeatedly
This crashes at https://github.com/flyingcircusio/fc-nixos/blob/PL-131024-ceph-nautilus-packaging/tests/ceph-nautilus.nix#L308
We extracted a traceback:
(gdb) bt #0 0x00007f98a3b335ba in raise () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0 #1 0x00005570019a589b in reraise_fatal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:81 #2 handle_fatal_signal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:326 #3 <signal handler called> #4 0x00007f98a3995bd3 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /nix/store/paqfl70z4zxip8lvpsijbspi0y2wzg4i-gcc-10.3.0-lib/lib/libstdc++.so.6 #5 0x0000557001741a40 in std::_Rb_tree_iterator<std::pair<std::pair<long, int> const, store_statfs_t> >::operator++ (this=<synthetic pointer>) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208 #6 PGMap::apply_incremental (this=this@entry=0x7f988c04fd20, cct=0x5570029490d0, inc=...) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208 #7 0x00005570017c8f42 in ClusterState::notify_osdmap (this=this@entry=0x7f988c04f988, osd_map=...) at /build/ceph-14.2.22/src/mgr/ClusterState.cc:185 #8 0x00005570018267cd in operator() (pg_map=..., osd_map=..., __closure=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:473 #9 Objecter::with_osdmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)>, const PGMap&> (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/osdc/Objecter.h:2057 #10 ClusterState::with_osdmap_and_pgmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)> > (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/mgr/ClusterState.h:134 #11 Mgr::handle_osd_map (this=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:435 #12 0x0000557001826f18 in Mgr::ms_dispatch (this=this@entry=0x7f988c04f620, m=m@entry=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/Mgr.cc:542 #13 0x0000557001837eab in MgrStandby::ms_dispatch (this=0x7ffca961aba0, m=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/MgrStandby.cc:449 #14 0x00005570018134d7 in Dispatcher::ms_dispatch2 (this=0x7ffca961aba0, m=...) at /build/ceph-14.2.22/src/msg/Dispatcher.h:126 #15 0x00007f98a4bf1e88 in Messenger::ms_deliver_dispatch (this=0x557002a15060, m=...) at /build/ceph-14.2.22/src/msg/Messenger.h:692 #16 0x00007f98a4bed7b7 in DispatchQueue::entry (this=0x557002a153b8) at /build/ceph-14.2.22/src/msg/DispatchQueue.cc:197 #17 0x00007f98a4cc29cd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-14.2.22/src/msg/DispatchQueue.h:102 #18 0x00007f98a4a43c90 in Thread::entry_wrapper (this=0x557002a15550) at /build/ceph-14.2.22/src/common/Thread.cc:84 #19 0x00007f98a3b28e9e in start_thread () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0 #20 0x00007f98a36af4af in clone () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libc.so.6
I'm adding this ticket as research hasn't found any indication of a known crash/bug around there. I'm going to read through the pgmap code, however, I think this sounds like it would be a threading issue, however, I'd think that it would be much harder to reproduce ...
I'll try to see whether I can configure the mgr to run with fewer threads and see whether that helps.
Files