Project

General

Profile

Actions

Bug #48787

open

ceph-mgr segfault

Added by Jeff Layton over 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I did a build today based on current master (commit b72feb781ff6be5007963cf3cb1e94e677a19c2b), and when I tried to use vstart, ceph-mgr crashed a bit after starting. I ran it under gdb and got this. Here's the last bit of the log + the stack trace:

2021-01-07T11:46:20.615-0500 7fffec84c1c0 10 mgr[py] Computed sys.path '/usr/lib64/python39.zip:/usr/lib64/python3.9:/usr/lib64/python3.9/lib-dynload:/home/jlayton/git/ceph/src/pybind/mgr:/usr/local/lib64/python3.9/site-packages:/usr/local/lib/python3.9/site-packages:/usr/lib64/python3.9/site-packages:/usr/lib/python3.9/site-packages'
2021-01-07T11:46:20.628-0500 7fffec84c1c0  4 mgr[py] load_subclass_of: found class: 'iostat.Module'
2021-01-07T11:46:20.628-0500 7fffec84c1c0 20 mgr[py] loaded command iostat
2021-01-07T11:46:20.628-0500 7fffec84c1c0 10 mgr[py] loaded 1 commands
2021-01-07T11:46:20.629-0500 7fffec84c1c0 20 mgr[py] loaded module option log_level
2021-01-07T11:46:20.629-0500 7fffec84c1c0 20 mgr[py] loaded module option log_to_file
2021-01-07T11:46:20.629-0500 7fffec84c1c0 20 mgr[py] loaded module option log_to_cluster
2021-01-07T11:46:20.629-0500 7fffec84c1c0 20 mgr[py] loaded module option log_to_cluster_level
2021-01-07T11:46:20.629-0500 7fffec84c1c0 10 mgr[py] loaded 4 options
2021-01-07T11:46:20.629-0500 7fffec84c1c0  4 mgr[py] Standby mode not provided by module 'iostat'
2021-01-07T11:46:20.629-0500 7fffec84c1c0  1 mgr[py] Loading python module 'k8sevents'
2021-01-07T11:46:20.647-0500 7fffec84c1c0 10 mgr[py] Computed sys.path '/usr/lib64/python39.zip:/usr/lib64/python3.9:/usr/lib64/python3.9/lib-dynload:/home/jlayton/git/ceph/src/pybind/mgr:/usr/local/lib64/python3.9/site-packages:/usr/local/lib/python3.9/site-packages:/usr/lib64/python3.9/site-packages:/usr/lib/python3.9/site-packages'
/builddir/build/BUILD/Python-3.9.1/Modules/_decimal/libmpdec/context.c:56: warning: mpd_setminalloc: ignoring request to set MPD_MINALLOC a second time

2021-01-07T11:46:20.947-0500 7fffec84c1c0 -1 mgr[py] Module not found: 'k8sevents'

Thread 1 "ceph-mgr" received signal SIGSEGV, Segmentation fault.
0x00007fffd199d0d8 in PyArray_Item_INCREF ()
   from /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-4.fc33.x86_64 expat-2.2.8-3.fc33.x86_64 flexiblas-netlib-3.0.4-1.fc33.x86_64 flexiblas-openblas-openmp-3.0.4-1.fc33.x86_64 fmt-7.0.3-1.fc33.x86_64 gperftools-libs-2.8-3.fc33.x86_64 libblkid-2.36-3.fc33.x86_64 libffi-3.1-26.fc33.x86_64 libgcc-10.2.1-9.fc33.x86_64 libgfortran-10.2.1-9.fc33.x86_64 libgomp-10.2.1-9.fc33.x86_64 libibverbs-32.0-1.fc33.x86_64 libnl3-3.5.0-5.fc33.x86_64 libquadmath-10.2.1-9.fc33.x86_64 librdmacm-32.0-1.fc33.x86_64 libstdc++-10.2.1-9.fc33.x86_64 libunwind-1.4.0-4.fc33.x86_64 libuuid-2.36-3.fc33.x86_64 libxcrypt-4.4.17-1.fc33.x86_64 libyaml-0.2.5-3.fc33.x86_64 lttng-ust-2.12.0-3.fc33.x86_64 numactl-libs-2.0.14-1.fc33.x86_64 openblas-openmp-0.3.12-1.fc33.x86_64 openssl-libs-1.1.1i-1.fc33.x86_64 python3-Bottleneck-1.2.1-16.fc33.x86_64 python3-bcrypt-3.1.7-6.fc33.x86_64 python3-cephfs-15.2.8-1.fc33.x--Typ--Typ--Typ--Type <RE--Type <RE--Type <RET> fo--Type <RET> for mor--Typ--Type <RE--Type <RET> for more, q to quit, c to continue without paging--
-246.7-2.fc33.x86_64 userspace-rcu-0.12.1-2.fc33.x86_64 xz-libs-5.2.5-3.fc33.x86_64 zlib-1.2.11-23.fc33.x86_64
(gdb) bt
#0  0x00007fffd199d0d8 in PyArray_Item_INCREF () from /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so
#1  0x00007fffd19a4d59 in PyArray_FromScalar () from /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so
#2  0x00007fffd19a52de in gentype_nonzero_number.lto_priv () from /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so
#3  0x00007fffed5521da in PyObject_IsTrue.part.0 () from /lib64/libpython3.9.so.1.0
#4  0x00007fffed5424c3 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#5  0x00007fffed54a50b in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#6  0x00007fffed541c45 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#7  0x00007fffed53c199 in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#8  0x00007fffed54a256 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#9  0x00007fffed553074 in method_vectorcall () from /lib64/libpython3.9.so.1.0
#10 0x00007fffed53def5 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#11 0x00007fffed53c199 in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#12 0x00007fffed54a256 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#13 0x00007fffed54544a in _PyObject_FastCallDictTstate () from /lib64/libpython3.9.so.1.0
#14 0x00007fffed5519e2 in slot_tp_init () from /lib64/libpython3.9.so.1.0
#15 0x00007fffed545bb3 in type_call () from /lib64/libpython3.9.so.1.0
#16 0x00007fffed5459cb in _PyObject_MakeTpCall () from /lib64/libpython3.9.so.1.0
#17 0x00007fffed542b8a in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#18 0x00007fffed53bbe1 in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#19 0x00007fffed54a256 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#20 0x00007fffed5bbd24 in _PyObject_VectorcallTstate.lto_priv.5 () from /lib64/libpython3.9.so.1.0
#21 0x00007fffed54d3c4 in _PyObject_CallFunctionVa () from /lib64/libpython3.9.so.1.0
#22 0x00007fffed4f6548 in PyEval_CallFunction () from /lib64/libpython3.9.so.1.0
#23 0x00005555561fead7 in boost::python::call<boost::python::api::object, boost::python::handle<_object>, boost::python::handle<_object>, boost::python::handle<_object> > (callable=0x7fff4bd984c0, 
    a0=..., a1=..., a2=...) at /home/jlayton/git/ceph/build/boost/include/boost/python/call.hpp:62
#24 0x00005555561fdce4 in boost::python::api::object_operators<boost::python::api::object>::operator()<boost::python::handle<_object>, boost::python::handle<_object>, boost::python::handle<_object> > (this=0x7fffffff9728, a0=..., a1=..., a2=...) at /home/jlayton/git/ceph/build/boost/include/boost/python/object_call.hpp:19
#25 0x00005555561f5df6 in handle_pyerror[abi:cxx11]() () at /home/jlayton/git/ceph/src/mgr/PyModule.cc:72
#26 0x00005555561fbcb9 in PyModule::load_subclass_of (this=0x555557d3ec70, base_class=0x555556527be2 "MgrModule", py_class=0x555557d3ed70) at /home/jlayton/git/ceph/src/mgr/PyModule.cc:655
#27 0x00005555561f8368 in PyModule::load (this=0x555557d3ec70, pMainThreadState=0x5555570bd7a0) at /home/jlayton/git/ceph/src/mgr/PyModule.cc:337
#28 0x00005555562028a8 in PyModuleRegistry::init (this=0x7fffffffcd20) at /home/jlayton/git/ceph/src/mgr/PyModuleRegistry.cc:86
#29 0x00005555561c9152 in MgrStandby::init (this=0x7fffffffa190) at /home/jlayton/git/ceph/src/mgr/MgrStandby.cc:186
#30 0x0000555555fa0ea3 in main (argc=6, argv=0x7fffffffd138) at /home/jlayton/git/ceph/src/ceph_mgr.cc:70

This build is on f33, so it's possible the bug is somewhere in a dependent lib. I can reproduce this at will however


Related issues 1 (1 open0 closed)

Related to mgr - Bug #45574: subinterpreters: ceph/mgr/rook RuntimeError on import of RookOrchestrator - ceph cluster does not startNew

Actions
Actions #1

Updated by Jeff Layton over 3 years ago

The stack trace looks really similar to this issue. Could ceph-mgr be doing the same unsupported activity?

https://github.com/numpy/numpy/issues/7595

Actions #2

Updated by Sebastian Wagner over 3 years ago

looks like numpy doesn't like subinterpreters: https://github.com/numpy/numpy/issues/7595#issuecomment-270559663

Actions #3

Updated by Josh Durgin over 3 years ago

The only module using numpy is diskprediction_local - checking if we absolutely need it.

Actions #4

Updated by Jeff Layton about 3 years ago

Josh, did you ever get an answer on this one?

Actions #5

Updated by Jeff Layton about 3 years ago

  • Assignee set to Josh Durgin

Assigning to Josh for now since he was investigating whether we could remove numpy.

Actions #6

Updated by Josh Durgin about 3 years ago

  • Assignee deleted (Josh Durgin)

Hey Jeff, IIRC the answer was no, it's not an inherent dependency, though it would take some work to switch to another library.

Are you seeing this crash very often? I'm only seeing one cluster reporting this in telemetry, which appears to be a test cluster (only one osd host).

Actions #7

Updated by Jeff Layton about 3 years ago

At one time I was seeing it routinely when trying to run a vstart cluster on fedora 33. Today though, my build seemed to work just fine. I expect we'll see it again though. It probably just got lucky.

Actions #8

Updated by Sebastian Wagner almost 3 years ago

turns out rook also uses numpy!

Actions #9

Updated by Sebastian Wagner almost 3 years ago

  • Related to Bug #45574: subinterpreters: ceph/mgr/rook RuntimeError on import of RookOrchestrator - ceph cluster does not start added
Actions #10

Updated by Ali Maredia almost 3 years ago

I as well am seeing the same crash on vstart clusters on the master branch built on Fedora 33. I will just run without the mgr for now.

Actions

Also available in: Atom PDF