Project

General

Profile

Bug #58269

ceph mgr fail after upgrade to pacific

Added by Eugen Block over 1 year ago. Updated about 1 year ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
ceph-mgr
Target version:
% Done:

90%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading from Nautilus to Pacific (16.2.10) I'm experiencing failing MGR daemons. The pods are still running but stopped logging or responding. The standby MGRs take over until the last one becomes unresponsive, resulting in "no active MGR" warning. I'm not sure if [1] is the exact thing I'm facing here but it looks like a deadlock to me. I noticed the same behavior in a customer cluster upgraded from Octopus to Pacific about two months ago, currently running 16.2.9. The only thing I did in those clusters was to browse the dashboard to compare log settings. I read somewhere that the prometheus module could play a role in this, but it's not enabled in our cluster (while it is running in the customer cluster). It seems reproducable by only clicking through the dashboard long enough.
I managed to get a gdb.txt from the customer cluster, attaching it to this tracker issue.

ceph01:~ # ceph versions {
"mon": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 3
},
"mgr": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 2
},
"osd": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 35
},
"mds": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 3
},
"rgw": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 1
},
"overall": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 44
}
}

[1] https://tracker.ceph.com/issues/55687

pacific-mgr-deadlock-gdb.txt View (302 KB) Eugen Block, 12/14/2022 11:55 AM

20221215_ndeceph03_ceph-mgr.gdb.txt View (223 KB) Mykola Golub, 12/16/2022 06:27 PM


Related issues

Copied to mgr - Backport #58805: pacific: ceph mgr fail after upgrade to pacific Resolved
Copied to mgr - Backport #58806: quincy: ceph mgr fail after upgrade to pacific In Progress

History

#1 Updated by Mykola Golub over 1 year ago

Eugen provided me with a backtrace for another case. It looks similar though some symbols are better resolved so it is much more clear here.

I think the problem with these threads:

Thread 83 (Thread 0x7fc8d919b700 (LWP 229)):
#0 0x00007fc9b5753838 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fc9bf896e21 in take_gil () from /lib64/libpython3.6m.so.1.0
#2 0x00007fc9bf942e92 in PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0
#3 0x00007fc9bf898e62 in _PyFunction_FastCallDict () from /lib64/libpython3.6m.so.1.0
#4 0x00007fc9bf899c3e in _PyObject_FastCallDict () from /lib64/libpython3.6m.so.1.0
#5 0x00007fc9bf8abb30 in method_call () from /lib64/libpython3.6m.so.1.0
#6 0x00007fc9bf899c1c in _PyObject_FastCallDict () from /lib64/libpython3.6m.so.1.0
#7 0x00007fc9bf9965cd in slot_tp_finalize () from /lib64/libpython3.6m.so.1.0
#8 0x00007fc9bf915e5f in collect () from /lib64/libpython3.6m.so.1.0
#9 0x00007fc9bf95a91d in collect_with_callback () from /lib64/libpython3.6m.so.1.0
#10 0x00007fc9bf89ef00 in _PyObject_GC_New () from /lib64/libpython3.6m.so.1.0
#11 0x00007fc9bf8aabec in PyList_New () from /lib64/libpython3.6m.so.1.0
#12 0x0000563bf29bd7be in PyFormatter::open_array_section (this=0x7fc8d9192660, name="tags") at /usr/src/debug/ceph-16.2.10-0.el8.x86_64/src/mgr/PyFormatter.cc:26
#13 0x00007fc9b6ac5545 in Option::dump(ceph::Formatter*) const () from /usr/lib64/ceph/libceph-common.so.2
#14 0x00007fc9b6a88130 in md_config_t::config_options(ceph::Formatter*) const () from /usr/lib64/ceph/libceph-common.so.2
#15 0x0000563bf28f953b in ceph::common::ConfigProxy::config_options (f=0x7fc8d9192660, this=0x563bf4cf2008) at /usr/src/debug/ceph-16.2.10-0.el8.x86_64/src/common/config_proxy.h:243
#16 ActivePyModules::get_python (this=0x563bfc996400, what=...) at /usr/src/debug/ceph-16.2.10-0.el8.x86_64/src/mgr/ActivePyModules.cc:256
#17 0x0000563bf28fb8fd in ActivePyModules::cacheable_get_python (this=this@entry=0x563bfc996400, what="config_options") at /usr/src/debug/ceph-16.2.10-0.el8.x86_64/src/mgr/ActivePyModules.cc:193
#18 0x0000563bf290feff in ceph_state_get (self=<optimized out>, args=<optimized out>) at /usr/include/c++/8/ext/new_allocator.h:79
...
Thread 49 (Thread 0x7fc8f81d9700 (LWP 158)):
#0 0x00007fc9b575681d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fc9b574fb94 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2 0x0000563bf29189af in __gthread_mutex_lock (
_mutex=0x563bf4cf5790) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#3 _gthread_recursive_mutex_lock (_mutex=0x563bf4cf5790) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:810
#4 std::recursive_mutex::lock (this=0x563bf4cf5790) at /usr/include/c++/8/mutex:107
#5 std::lock_guard<std::recursive_mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/8/bits/std_mutex.h:162
#6 ceph::common::ConfigProxy::get_val (val=0x7fc8f81d5210, key=..., this=0x563bf4cf2008) at /usr/src/debug/ceph-16.2.10-0.el8.x86_64/src/common/config_proxy.h:137
#7 ceph_option_get (self=<optimized out>, args=<optimized out>) at /usr/src/debug/ceph-16.2.10-0.el8.x86_64/src/mgr/BaseMgrModule.cc:415
#8 0x00007fc9bf93b0d7 in call_function () from /lib64/libpython3.6m.so.1.0
#9 0x00007fc9bf93bfb8 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0
...
The thread 83 was executing "config dump" command, and got the ConfigProxy lock and then was building python formatter output. During this it dropped the python global lock (GIL) (to allow other python threads to execute, see [1]) and then tried to take it back to continue python execution (take_gil). But at that time the thread 49 (ceph_option_get) when it was holding the GIL wanted to get ConfigProxy lock too to get the config option value. So we are in a deadlock situation when the first threads keeps the config lock and needs the GIL to continue, and the second threads keeps the GIL and needs config lock to continue.

I was able to reproduce the issue by modifying one of mgr modules to call frequently ceph_option_get and (re)loading the pool page in dashboard (/#/pool url), which triggers the problematic code (get_python "config_options").

The fix could be just copy the global config to a local instance and use it with the python formatter.

[1] https://github.com/python/cpython/blob/3.7/Python/ceval.c#L979

#2 Updated by Mykola Golub over 1 year ago

  • Pull request ID set to 49487

#3 Updated by Mykola Golub over 1 year ago

  • Backport set to pacific,quincy

#4 Updated by Neha Ojha about 1 year ago

  • Status changed from In Progress to Fix Under Review

#6 Updated by Konstantin Shalygin about 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Target version set to v18.0.0
  • % Done changed from 0 to 90

#7 Updated by Backport Bot about 1 year ago

  • Copied to Backport #58805: pacific: ceph mgr fail after upgrade to pacific added

#8 Updated by Backport Bot about 1 year ago

  • Copied to Backport #58806: quincy: ceph mgr fail after upgrade to pacific added

#9 Updated by Backport Bot about 1 year ago

  • Tags set to backport_processed

Also available in: Atom PDF