Project

General

Profile

Actions

Bug #58215

closed

active mgr crashes with segfault when running 'ceph osd purge'

Added by Christian Theune over 1 year ago. Updated 10 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I know that Nautilus is out of support already, so maybe this just ends up for posterity but maybe we found something worthwhile to investigate.

Anyway: as we're upgrading some old clusters we're currently working through Nautilus. However, we do have a reliable reproducer for a clean installation within our automated test suite which creates a fresh cluster every time and repeatedly

https://github.com/flyingcircusio/fc-nixos/blob/PL-131024-ceph-nautilus-packaging/tests/ceph-nautilus.nix

This crashes at https://github.com/flyingcircusio/fc-nixos/blob/PL-131024-ceph-nautilus-packaging/tests/ceph-nautilus.nix#L308

We extracted a traceback:

(gdb) bt
#0  0x00007f98a3b335ba in raise () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0
#1  0x00005570019a589b in reraise_fatal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  0x00007f98a3995bd3 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /nix/store/paqfl70z4zxip8lvpsijbspi0y2wzg4i-gcc-10.3.0-lib/lib/libstdc++.so.6
#5  0x0000557001741a40 in std::_Rb_tree_iterator<std::pair<std::pair<long, int> const, store_statfs_t> >::operator++ (this=<synthetic pointer>) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208
#6  PGMap::apply_incremental (this=this@entry=0x7f988c04fd20, cct=0x5570029490d0, inc=...) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208
#7  0x00005570017c8f42 in ClusterState::notify_osdmap (this=this@entry=0x7f988c04f988, osd_map=...) at /build/ceph-14.2.22/src/mgr/ClusterState.cc:185
#8  0x00005570018267cd in operator() (pg_map=..., osd_map=..., __closure=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:473
#9  Objecter::with_osdmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)>, const PGMap&> (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/osdc/Objecter.h:2057
#10 ClusterState::with_osdmap_and_pgmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)> > (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/mgr/ClusterState.h:134
#11 Mgr::handle_osd_map (this=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:435
#12 0x0000557001826f18 in Mgr::ms_dispatch (this=this@entry=0x7f988c04f620, m=m@entry=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/Mgr.cc:542
#13 0x0000557001837eab in MgrStandby::ms_dispatch (this=0x7ffca961aba0, m=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/MgrStandby.cc:449
#14 0x00005570018134d7 in Dispatcher::ms_dispatch2 (this=0x7ffca961aba0, m=...) at /build/ceph-14.2.22/src/msg/Dispatcher.h:126
#15 0x00007f98a4bf1e88 in Messenger::ms_deliver_dispatch (this=0x557002a15060, m=...) at /build/ceph-14.2.22/src/msg/Messenger.h:692
#16 0x00007f98a4bed7b7 in DispatchQueue::entry (this=0x557002a153b8) at /build/ceph-14.2.22/src/msg/DispatchQueue.cc:197
#17 0x00007f98a4cc29cd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-14.2.22/src/msg/DispatchQueue.h:102
#18 0x00007f98a4a43c90 in Thread::entry_wrapper (this=0x557002a15550) at /build/ceph-14.2.22/src/common/Thread.cc:84
#19 0x00007f98a3b28e9e in start_thread () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0
#20 0x00007f98a36af4af in clone () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libc.so.6

I'm adding this ticket as research hasn't found any indication of a known crash/bug around there. I'm going to read through the pgmap code, however, I think this sounds like it would be a threading issue, however, I'd think that it would be much harder to reproduce ...

I'll try to see whether I can configure the mgr to run with fewer threads and see whether that helps.


Files

crash.log (33.6 KB) crash.log Christian Theune, 12/08/2022 07:40 AM

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #58303: active mgr crashes with segfault when running 'ceph osd purge'ResolvedChristian Theune

Actions
Actions #1

Updated by Christian Theune over 1 year ago

Here's some more logging output. I can increase logging as needed.

Actions #2

Updated by Christian Theune over 1 year ago

So, I went through the affected code and I'm no C/C++ expert, but I had a hunch based on the fact that there's an erase call during an iterator and in Python that's always a warning sign.

I think the iterator needs to be updated correctly like discussed in https://stackoverflow.com/questions/596162/can-you-remove-elements-from-a-stdlist-while-iterating-through-it

I'm currently playing around with a potential fix in https://github.com/flyingcircusio/ceph/commits/58215-fix-mon-delete-iterator

This might be completely wrong from my side as it looks like a Schrödinger bug and I'm unsure why this isn't broken all the time, so ...

Actions #3

Updated by Christian Theune over 1 year ago

Quick update: I'm not seeing the crash any longer, but there's other stuff going on in our environment and I'm not yet sure whether those are related or not.

Actions #4

Updated by Christian Theune over 1 year ago

As far as I can tell, the fix works properly and the other output I saw is spurious stuff regarding mon v1/v2 compatibility which doesn't seem to be related.

Actions #5

Updated by Christian Theune over 1 year ago

It's also worthwhile to note explicitly, that this seems to still be a valid concern even on current versions and main.

Actions #6

Updated by Christian Theune over 1 year ago

Because I added a similar change to the branch mentioned above, this is the specific fix for this specific crash: https://github.com/flyingcircusio/ceph/commit/f8f06e591cd448f9e0be5479462d58345eb4b952

Actions #7

Updated by Radoslaw Zarzynski 10 months ago

  • Is duplicate of Bug #58303: active mgr crashes with segfault when running 'ceph osd purge' added
Actions #8

Updated by Matan Breizman 10 months ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF