Bug #58215: active mgr crashes with segfault when running 'ceph osd purge' - Ceph - Ceph

Actions

Copy link

Bug #58215

closed

active mgr crashes with segfault when running 'ceph osd purge'

Added by Christian Theune over 1 year ago. Updated 10 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v14.2.22

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I know that Nautilus is out of support already, so maybe this just ends up for posterity but maybe we found something worthwhile to investigate.

Anyway: as we're upgrading some old clusters we're currently working through Nautilus. However, we do have a reliable reproducer for a clean installation within our automated test suite which creates a fresh cluster every time and repeatedly

https://github.com/flyingcircusio/fc-nixos/blob/PL-131024-ceph-nautilus-packaging/tests/ceph-nautilus.nix

This crashes at https://github.com/flyingcircusio/fc-nixos/blob/PL-131024-ceph-nautilus-packaging/tests/ceph-nautilus.nix#L308

We extracted a traceback:

(gdb) bt
#0  0x00007f98a3b335ba in raise () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0
#1  0x00005570019a589b in reraise_fatal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=11) at /build/ceph-14.2.22/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  0x00007f98a3995bd3 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /nix/store/paqfl70z4zxip8lvpsijbspi0y2wzg4i-gcc-10.3.0-lib/lib/libstdc++.so.6
#5  0x0000557001741a40 in std::_Rb_tree_iterator<std::pair<std::pair<long, int> const, store_statfs_t> >::operator++ (this=<synthetic pointer>) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208
#6  PGMap::apply_incremental (this=this@entry=0x7f988c04fd20, cct=0x5570029490d0, inc=...) at /build/ceph-14.2.22/src/mon/PGMap.cc:1208
#7  0x00005570017c8f42 in ClusterState::notify_osdmap (this=this@entry=0x7f988c04f988, osd_map=...) at /build/ceph-14.2.22/src/mgr/ClusterState.cc:185
#8  0x00005570018267cd in operator() (pg_map=..., osd_map=..., __closure=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:473
#9  Objecter::with_osdmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)>, const PGMap&> (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/osdc/Objecter.h:2057
#10 ClusterState::with_osdmap_and_pgmap<Mgr::handle_osd_map()::<lambda(const OSDMap&, const PGMap&)> > (cb=..., this=<optimized out>) at /build/ceph-14.2.22/src/mgr/ClusterState.h:134
#11 Mgr::handle_osd_map (this=<optimized out>) at /build/ceph-14.2.22/src/mgr/Mgr.cc:435
#12 0x0000557001826f18 in Mgr::ms_dispatch (this=this@entry=0x7f988c04f620, m=m@entry=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/Mgr.cc:542
#13 0x0000557001837eab in MgrStandby::ms_dispatch (this=0x7ffca961aba0, m=0x7f989404b720) at /build/ceph-14.2.22/src/mgr/MgrStandby.cc:449
#14 0x00005570018134d7 in Dispatcher::ms_dispatch2 (this=0x7ffca961aba0, m=...) at /build/ceph-14.2.22/src/msg/Dispatcher.h:126
#15 0x00007f98a4bf1e88 in Messenger::ms_deliver_dispatch (this=0x557002a15060, m=...) at /build/ceph-14.2.22/src/msg/Messenger.h:692
#16 0x00007f98a4bed7b7 in DispatchQueue::entry (this=0x557002a153b8) at /build/ceph-14.2.22/src/msg/DispatchQueue.cc:197
#17 0x00007f98a4cc29cd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-14.2.22/src/msg/DispatchQueue.h:102
#18 0x00007f98a4a43c90 in Thread::entry_wrapper (this=0x557002a15550) at /build/ceph-14.2.22/src/common/Thread.cc:84
#19 0x00007f98a3b28e9e in start_thread () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libpthread.so.0
#20 0x00007f98a36af4af in clone () from /nix/store/nprym6lf8lzhp1irb42lb4vp8069l5rj-glibc-2.32-54/lib/libc.so.6

I'm adding this ticket as research hasn't found any indication of a known crash/bug around there. I'm going to read through the pgmap code, however, I think this sounds like it would be a threading issue, however, I'd think that it would be much harder to reproduce ...

I'll try to see whether I can configure the mgr to run with fewer threads and see whether that helps.

Files

crash.log (33.6 KB) crash.log

Christian Theune, 12/08/2022 07:40 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #58215

active mgr crashes with segfault when running 'ceph osd purge'

Updated by Christian Theune over 1 year ago

Updated by Christian Theune over 1 year ago

Updated by Christian Theune over 1 year ago

Updated by Christian Theune over 1 year ago

Updated by Christian Theune over 1 year ago

Updated by Christian Theune over 1 year ago

Updated by Radoslaw Zarzynski 10 months ago

Updated by Matan Breizman 10 months ago