Bug #58215


active mgr crashes with segfault when running 'ceph osd purge'

Added by Christian Theune over 1 year ago. Updated 10 months ago.

Target version:
% Done:


Community (user)
2 - major
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


I know that Nautilus is out of support already, so maybe this just ends up for posterity but maybe we found something worthwhile to investigate.

Anyway: as we're upgrading some old clusters we're currently working through Nautilus. However, we do have a reliable reproducer for a clean installation within our automated test suite which creates a fresh cluster every time and repeatedly

This crashes at

We extracted a traceback:

(gdb) bt
#0  0x00007f98a3b335ba in raise () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/
#1  0x00005570019a589b in reraise_fatal (signum=11) at /build/ceph-14.2.22/src/global/
#2  handle_fatal_signal (signum=11) at /build/ceph-14.2.22/src/global/
#3  <signal handler called>
#4  0x00007f98a3995bd3 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /nix/store/paqfl70z4zxip8lvpsijbspi0y2wzg4i-gcc-10.3.0-lib/lib/
#5  0x0000557001741a40 in std::_Rb_tree_iterator<std::pair<std::pair<long, int> const, store_statfs_t> >::operator++ (this=<synthetic pointer>) at /build/ceph-14.2.22/src/mon/
#6  PGMap::apply_incremental (this=this@entry=0x7f988c04fd20, cct=0x5570029490d0, inc=...) at /build/ceph-14.2.22/src/mon/
#7  0x00005570017c8f42 in ClusterState::notify_osdmap (this=this@entry=0x7f988c04f988, osd_map=...) at /build/ceph-14.2.22/src/mgr/
#8  0x00005570018267cd in operator() (pg_map=..., osd_map=..., __closure=<optimized out>) at /build/ceph-14.2.22/src/mgr/
#9  Objecter::with_osdmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)>, const PGMap&> (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/osdc/Objecter.h:2057
#10 ClusterState::with_osdmap_and_pgmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)> > (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/mgr/ClusterState.h:134
#11 Mgr::handle_osd_map (this=<optimized out>) at /build/ceph-14.2.22/src/mgr/
#12 0x0000557001826f18 in Mgr::ms_dispatch (this=this@entry=0x7f988c04f620, m=m@entry=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/
#13 0x0000557001837eab in MgrStandby::ms_dispatch (this=0x7ffca961aba0, m=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/
#14 0x00005570018134d7 in Dispatcher::ms_dispatch2 (this=0x7ffca961aba0, m=...) at /build/ceph-14.2.22/src/msg/Dispatcher.h:126
#15 0x00007f98a4bf1e88 in Messenger::ms_deliver_dispatch (this=0x557002a15060, m=...) at /build/ceph-14.2.22/src/msg/Messenger.h:692
#16 0x00007f98a4bed7b7 in DispatchQueue::entry (this=0x557002a153b8) at /build/ceph-14.2.22/src/msg/
#17 0x00007f98a4cc29cd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-14.2.22/src/msg/DispatchQueue.h:102
#18 0x00007f98a4a43c90 in Thread::entry_wrapper (this=0x557002a15550) at /build/ceph-14.2.22/src/common/
#19 0x00007f98a3b28e9e in start_thread () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/
#20 0x00007f98a36af4af in clone () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/

I'm adding this ticket as research hasn't found any indication of a known crash/bug around there. I'm going to read through the pgmap code, however, I think this sounds like it would be a threading issue, however, I'd think that it would be much harder to reproduce ...

I'll try to see whether I can configure the mgr to run with fewer threads and see whether that helps.


crash.log (33.6 KB) crash.log Christian Theune, 12/08/2022 07:40 AM

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #58303: active mgr crashes with segfault when running 'ceph osd purge'ResolvedChristian Theune

Actions #1

Updated by Christian Theune over 1 year ago

Here's some more logging output. I can increase logging as needed.

Actions #2

Updated by Christian Theune over 1 year ago

So, I went through the affected code and I'm no C/C++ expert, but I had a hunch based on the fact that there's an erase call during an iterator and in Python that's always a warning sign.

I think the iterator needs to be updated correctly like discussed in

I'm currently playing around with a potential fix in

This might be completely wrong from my side as it looks like a Schrödinger bug and I'm unsure why this isn't broken all the time, so ...

Actions #3

Updated by Christian Theune over 1 year ago

Quick update: I'm not seeing the crash any longer, but there's other stuff going on in our environment and I'm not yet sure whether those are related or not.

Actions #4

Updated by Christian Theune over 1 year ago

As far as I can tell, the fix works properly and the other output I saw is spurious stuff regarding mon v1/v2 compatibility which doesn't seem to be related.

Actions #5

Updated by Christian Theune over 1 year ago

It's also worthwhile to note explicitly, that this seems to still be a valid concern even on current versions and main.

Actions #6

Updated by Christian Theune over 1 year ago

Because I added a similar change to the branch mentioned above, this is the specific fix for this specific crash:

Actions #7

Updated by Radoslaw Zarzynski 10 months ago

  • Is duplicate of Bug #58303: active mgr crashes with segfault when running 'ceph osd purge' added
Actions #8

Updated by Matan Breizman 10 months ago

  • Status changed from New to Duplicate

Also available in: Atom PDF