Bug #58215
closedactive mgr crashes with segfault when running 'ceph osd purge'
0%
Description
I know that Nautilus is out of support already, so maybe this just ends up for posterity but maybe we found something worthwhile to investigate.
Anyway: as we're upgrading some old clusters we're currently working through Nautilus. However, we do have a reliable reproducer for a clean installation within our automated test suite which creates a fresh cluster every time and repeatedly
This crashes at https://github.com/flyingcircusio/fc-nixos/blob/PL-131024-ceph-nautilus-packaging/tests/ceph-nautilus.nix#L308
We extracted a traceback:
(gdb) bt #0 0x00007f98a3b335ba in raise () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0 #1 0x00005570019a589b in reraise_fatal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:81 #2 handle_fatal_signal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:326 #3 <signal handler called> #4 0x00007f98a3995bd3 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /nix/store/paqfl70z4zxip8lvpsijbspi0y2wzg4i-gcc-10.3.0-lib/lib/libstdc++.so.6 #5 0x0000557001741a40 in std::_Rb_tree_iterator<std::pair<std::pair<long, int> const, store_statfs_t> >::operator++ (this=<synthetic pointer>) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208 #6 PGMap::apply_incremental (this=this@entry=0x7f988c04fd20, cct=0x5570029490d0, inc=...) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208 #7 0x00005570017c8f42 in ClusterState::notify_osdmap (this=this@entry=0x7f988c04f988, osd_map=...) at /build/ceph-14.2.22/src/mgr/ClusterState.cc:185 #8 0x00005570018267cd in operator() (pg_map=..., osd_map=..., __closure=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:473 #9 Objecter::with_osdmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)>, const PGMap&> (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/osdc/Objecter.h:2057 #10 ClusterState::with_osdmap_and_pgmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)> > (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/mgr/ClusterState.h:134 #11 Mgr::handle_osd_map (this=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:435 #12 0x0000557001826f18 in Mgr::ms_dispatch (this=this@entry=0x7f988c04f620, m=m@entry=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/Mgr.cc:542 #13 0x0000557001837eab in MgrStandby::ms_dispatch (this=0x7ffca961aba0, m=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/MgrStandby.cc:449 #14 0x00005570018134d7 in Dispatcher::ms_dispatch2 (this=0x7ffca961aba0, m=...) at /build/ceph-14.2.22/src/msg/Dispatcher.h:126 #15 0x00007f98a4bf1e88 in Messenger::ms_deliver_dispatch (this=0x557002a15060, m=...) at /build/ceph-14.2.22/src/msg/Messenger.h:692 #16 0x00007f98a4bed7b7 in DispatchQueue::entry (this=0x557002a153b8) at /build/ceph-14.2.22/src/msg/DispatchQueue.cc:197 #17 0x00007f98a4cc29cd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-14.2.22/src/msg/DispatchQueue.h:102 #18 0x00007f98a4a43c90 in Thread::entry_wrapper (this=0x557002a15550) at /build/ceph-14.2.22/src/common/Thread.cc:84 #19 0x00007f98a3b28e9e in start_thread () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0 #20 0x00007f98a36af4af in clone () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libc.so.6
I'm adding this ticket as research hasn't found any indication of a known crash/bug around there. I'm going to read through the pgmap code, however, I think this sounds like it would be a threading issue, however, I'd think that it would be much harder to reproduce ...
I'll try to see whether I can configure the mgr to run with fewer threads and see whether that helps.
Files
Updated by Christian Theune over 1 year ago
Here's some more logging output. I can increase logging as needed.
Updated by Christian Theune over 1 year ago
So, I went through the affected code and I'm no C/C++ expert, but I had a hunch based on the fact that there's an erase call during an iterator and in Python that's always a warning sign.
I think the iterator needs to be updated correctly like discussed in https://stackoverflow.com/questions/596162/can-you-remove-elements-from-a-stdlist-while-iterating-through-it
I'm currently playing around with a potential fix in https://github.com/flyingcircusio/ceph/commits/58215-fix-mon-delete-iterator
This might be completely wrong from my side as it looks like a Schrödinger bug and I'm unsure why this isn't broken all the time, so ...
Updated by Christian Theune over 1 year ago
Quick update: I'm not seeing the crash any longer, but there's other stuff going on in our environment and I'm not yet sure whether those are related or not.
Updated by Christian Theune over 1 year ago
As far as I can tell, the fix works properly and the other output I saw is spurious stuff regarding mon v1/v2 compatibility which doesn't seem to be related.
Updated by Christian Theune over 1 year ago
It's also worthwhile to note explicitly, that this seems to still be a valid concern even on current versions and main.
Updated by Christian Theune over 1 year ago
Because I added a similar change to the branch mentioned above, this is the specific fix for this specific crash: https://github.com/flyingcircusio/ceph/commit/f8f06e591cd448f9e0be5479462d58345eb4b952
Updated by Radoslaw Zarzynski 10 months ago
- Is duplicate of Bug #58303: active mgr crashes with segfault when running 'ceph osd purge' added