Bug #58215: active mgr crashes with segfault when running 'ceph osd purge' - Ceph - Ceph

Actions

Copy link

Bug #58215

closed

active mgr crashes with segfault when running 'ceph osd purge'

Added by Christian Theune over 1 year ago. Updated 10 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v14.2.22

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I know that Nautilus is out of support already, so maybe this just ends up for posterity but maybe we found something worthwhile to investigate.

Anyway: as we're upgrading some old clusters we're currently working through Nautilus. However, we do have a reliable reproducer for a clean installation within our automated test suite which creates a fresh cluster every time and repeatedly

https://github.com/flyingcircusio/fc-nixos/blob/PL-131024-ceph-nautilus-packaging/tests/ceph-nautilus.nix

This crashes at https://github.com/flyingcircusio/fc-nixos/blob/PL-131024-ceph-nautilus-packaging/tests/ceph-nautilus.nix#L308

We extracted a traceback:

(gdb) bt
#0  0x00007f98a3b335ba in raise () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0
#1  0x00005570019a589b in reraise_fatal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  0x00007f98a3995bd3 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /nix/store/paqfl70z4zxip8lvpsijbspi0y2wzg4i-gcc-10.3.0-lib/lib/libstdc++.so.6
#5  0x0000557001741a40 in std::_Rb_tree_iterator<std::pair<std::pair<long, int> const, store_statfs_t> >::operator++ (this=<synthetic pointer>) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208
#6  PGMap::apply_incremental (this=this@entry=0x7f988c04fd20, cct=0x5570029490d0, inc=...) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208
#7  0x00005570017c8f42 in ClusterState::notify_osdmap (this=this@entry=0x7f988c04f988, osd_map=...) at /build/ceph-14.2.22/src/mgr/ClusterState.cc:185
#8  0x00005570018267cd in operator() (pg_map=..., osd_map=..., __closure=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:473
#9  Objecter::with_osdmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)>, const PGMap&> (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/osdc/Objecter.h:2057
#10 ClusterState::with_osdmap_and_pgmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)> > (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/mgr/ClusterState.h:134
#11 Mgr::handle_osd_map (this=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:435
#12 0x0000557001826f18 in Mgr::ms_dispatch (this=this@entry=0x7f988c04f620, m=m@entry=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/Mgr.cc:542
#13 0x0000557001837eab in MgrStandby::ms_dispatch (this=0x7ffca961aba0, m=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/MgrStandby.cc:449
#14 0x00005570018134d7 in Dispatcher::ms_dispatch2 (this=0x7ffca961aba0, m=...) at /build/ceph-14.2.22/src/msg/Dispatcher.h:126
#15 0x00007f98a4bf1e88 in Messenger::ms_deliver_dispatch (this=0x557002a15060, m=...) at /build/ceph-14.2.22/src/msg/Messenger.h:692
#16 0x00007f98a4bed7b7 in DispatchQueue::entry (this=0x557002a153b8) at /build/ceph-14.2.22/src/msg/DispatchQueue.cc:197
#17 0x00007f98a4cc29cd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-14.2.22/src/msg/DispatchQueue.h:102
#18 0x00007f98a4a43c90 in Thread::entry_wrapper (this=0x557002a15550) at /build/ceph-14.2.22/src/common/Thread.cc:84
#19 0x00007f98a3b28e9e in start_thread () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0
#20 0x00007f98a36af4af in clone () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libc.so.6

I'm adding this ticket as research hasn't found any indication of a known crash/bug around there. I'm going to read through the pgmap code, however, I think this sounds like it would be a threading issue, however, I'd think that it would be much harder to reproduce ...

I'll try to see whether I can configure the mgr to run with fewer threads and see whether that helps.

Files

crash.log (33.6 KB) crash.log

Christian Theune, 12/08/2022 07:40 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Christian Theune over 1 year ago

File crash.log crash.log added

Here's some more logging output. I can increase logging as needed.

Actions

Copy link

Updated by Christian Theune over 1 year ago

So, I went through the affected code and I'm no C/C++ expert, but I had a hunch based on the fact that there's an erase call during an iterator and in Python that's always a warning sign.

I think the iterator needs to be updated correctly like discussed in https://stackoverflow.com/questions/596162/can-you-remove-elements-from-a-stdlist-while-iterating-through-it

I'm currently playing around with a potential fix in https://github.com/flyingcircusio/ceph/commits/58215-fix-mon-delete-iterator

This might be completely wrong from my side as it looks like a Schrödinger bug and I'm unsure why this isn't broken all the time, so ...

Actions

Copy link

Updated by Christian Theune over 1 year ago

Quick update: I'm not seeing the crash any longer, but there's other stuff going on in our environment and I'm not yet sure whether those are related or not.

Actions

Copy link

Updated by Christian Theune over 1 year ago

As far as I can tell, the fix works properly and the other output I saw is spurious stuff regarding mon v1/v2 compatibility which doesn't seem to be related.

Actions

Copy link