Bug #35985
closeddeadlock in standby ceph-mgr daemons
0%
Description
From "[ceph-users] Standby mgr stopped sending beacons after upgrade to 12.2.8"
Thread 11 (Thread 0x7fc30888d700 (LWP 224053)): #0 0x00007fc30f2e0afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x00007fc30f2e0b8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x00007fc30f2e0c2b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x00007fc311275735 in PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0 #4 0x00007fc311241296 in PyEval_RestoreThread () from /lib64/libpython2.7.so.1.0 #5 0x00007fc31127942e in lock_PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0 #6 0x00007fc311248cf0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #7 0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #8 0x00007fc31124853c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #9 0x00007fc3112486bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #10 0x00007fc3112486bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #11 0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #12 0x00007fc31124853c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #13 0x00007fc3112486bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #14 0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #15 0x00007fc3111d4978 in function_call () from /lib64/libpython2.7.so.1.0 #16 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #17 0x00007fc3111bea55 in instancemethod_call () from /lib64/libpython2.7.so.1.0 #18 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #19 0x00007fc311206a87 in slot_tp_init () from /lib64/libpython2.7.so.1.0 #20 0x00007fc31120579f in type_call () from /lib64/libpython2.7.so.1.0 #21 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #22 0x00007fc3112418f7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0 #23 0x000056446a400480 in StandbyPyModule::load (this=0x564475c03420) at /usr/src/debug/ceph-12.2.8/src/mgr/StandbyPyModules.cc:124 #24 0x000056446a40160f in StandbyPyModules::start_one (this=0x564476345340, module_name="prometheus", pClass=<optimized out>, pMyThreadState=...) at /usr/src/debug/ceph-12.2.8/src/mgr/StandbyPyModules.cc:96 #25 0x000056446a405265 in PyModuleRegistry::standby_start (this=this@entry=0x7ffd3bf1e680, monc=monc@entry=0x7ffd3bf1c8c8) at /usr/src/debug/ceph-12.2.8/src/mgr/PyModuleRegistry.cc:321 #26 0x000056446a41a246 in MgrStandby::handle_mgr_map (this=this@entry=0x7ffd3bf1c8b0, mmap=mmap@entry=0x5644755942c0) at /usr/src/debug/ceph-12.2.8/src/mgr/MgrStandby.cc:361 #27 0x000056446a41ab04 in MgrStandby::ms_dispatch (this=0x7ffd3bf1c8b0, m=0x5644755942c0) at /usr/src/debug/ceph-12.2.8/src/mgr/MgrStandby.cc:376 #28 0x000056446a815cb2 in ms_deliver_dispatch (m=0x5644755942c0, this=0x564475332700) at /usr/src/debug/ceph-12.2.8/src/msg/Messenger.h:668 #29 DispatchQueue::entry (this=0x564475332858) at /usr/src/debug/ceph-12.2.8/src/msg/DispatchQueue.cc:197 #30 0x000056446a5ffbed in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /usr/src/debug/ceph-12.2.8/src/msg/DispatchQueue.h:101 #31 0x00007fc30f2dae25 in start_thread () from /lib64/libpthread.so.0 #32 0x00007fc30e3babad in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fc2eb7dc700 (LWP 224066)): #0 0x00007fc30f2de995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000056446a402e1e in Cond::Wait (this=0x564476345558, mutex=...) at /usr/src/debug/ceph-12.2.8/src/common/Cond.h:48 #2 0x000056446a400ab5 in with_config<StandbyPyModule::get_config(const string&, std::string*) const::__lambda8> ( cb=<unknown type in /usr/lib/debug/usr/bin/ceph-mgr.debug, CU 0xc5a7bd, DIE 0xe5b056>, this=0x5644763453d0) at /usr/src/debug/ceph-12.2.8/src/mgr/StandbyPyModules.h:76 #3 StandbyPyModule::get_config (this=0x564475c036c0, key="ceph06/server_addr", value=value@entry=0x7fc2eb7d9ef0) at /usr/src/debug/ceph-12.2.8/src/mgr/StandbyPyModules.cc:186 #4 0x000056446a413a91 in ceph_config_get (self=0x7fc2ec096bd8, args=<optimized out>) at /usr/src/debug/ceph-12.2.8/src/mgr/BaseMgrStandbyModule.cc:73 #5 0x00007fc311248cf0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #6 0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #7 0x00007fc31124853c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #8 0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #9 0x00007fc31124853c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #10 0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #11 0x00007fc3111d4978 in function_call () from /lib64/libpython2.7.so.1.0 #12 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #13 0x00007fc3111bea55 in instancemethod_call () from /lib64/libpython2.7.so.1.0 #14 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #15 0x00007fc3111afb45 in call_function_tail () from /lib64/libpython2.7.so.1.0 #16 0x00007fc3111afe7b in PyObject_CallMethod () from /lib64/libpython2.7.so.1.0 #17 0x000056446a407ffc in PyModuleRunner::serve (this=0x564475c036c0) at /usr/src/debug/ceph-12.2.8/src/mgr/PyModuleRunner.cc:51 #18 0x000056446a40867f in PyModuleRunner::PyModuleRunnerThread::entry (this=0x564475c036f8) at /usr/src/debug/ceph-12.2.8/src/mgr/PyModuleRunner.cc:112 #19 0x00007fc30f2dae25 in start_thread () from /lib64/libpthread.so.0 #20 0x00007fc30e3babad in clone () from /lib64/libc.so.6
Updated by John Spray over 5 years ago
StandbyPyModule::get_config is using state.with_config without dropping the GIL around taking the lock.
Updated by Brad Hubbard about 5 years ago
As John said thread 2 is here in src/mgr/StandbyPyModules.cc
168 bool StandbyPyModule::get_config(const std::string &key,
169 std::string *value) const
170 {
171 PyThreadState *tstate = PyEval_SaveThread();
172 PyEval_RestoreThread(tstate);
173
174 const std::string global_key = PyModuleRegistry::config_prefix
175 + module_name + "/" + key;
176
177 dout(4) << __func__ << "key: " << global_key << dendl;
178
179 return state.with_config([global_key, value](const PyModuleConfig &config){
180 if (config.count(global_key)) {
181 *value = config.at(global_key);
182 return true;
183 } else {
184 return false;
185 }
186 });
It's taken the GIL via PyEval_RestoreThread and is currently calling state.with_config which looks like this.
34 class StandbyPyModuleState
35 {
36 mutable Mutex lock{"StandbyPyModuleState::lock"};
...
70 template<typename Callback, typename...Args>
71 auto with_config(Callback&& cb, Args&&... args) const ->
72 decltype(cb(config_cache, std::forward<Args>(args)...)) {
73 Mutex::Locker l(lock);
74
75 if (!is_config_loaded) {
76 config_loaded.Wait(lock); <-- HERE
77 }
It's waiting to acquire the "StandbyPyModuleState::lock". Meanwhile thread 11 has taken the same lock in StandbyPyModules::start_one and has called "load".
80 int StandbyPyModules::start_one(std::string const &module_name,
81 PyObject *pClass, const SafeThreadState &pMyThreadState)
82 {
83 Mutex::Locker l(lock); <-- "StandbyPyModuleState::lock"
84
85 assert(modules.count(module_name) == 0);
86
87 modules[module_name].reset(new StandbyPyModule(
88 state,
89 module_name, pClass,
90 pMyThreadState, clog));
91
92 if (modules.size() == 1) {
93 load_config_thread.create("LoadConfig");
94 }
95
96 int r = modules[module_name]->load(); <-- HERE
In load we create a GIL object but the call to acquire the GIL is deferred until we call PyThread_acquire_lock via the call to PyObject_CallObject.
110 int StandbyPyModule::load()
111 {
112 Gil gil(pMyThreadState, true);
113
114 // We tell the module how we name it, so that it can be consistent
115 // with us in logging etc.
116 auto pThisPtr = PyCapsule_New(this, nullptr, nullptr);
117 assert(pThisPtr != nullptr);
118 auto pModuleName = PyString_FromString(module_name.c_str());
119 assert(pModuleName != nullptr);
120 auto pArgs = PyTuple_Pack(2, pModuleName, pThisPtr);
121 Py_DECREF(pThisPtr);
122 Py_DECREF(pModuleName);
123
124 pClassInstance = PyObject_CallObject(pClass, pArgs); <--- HERE
So thread 2 holds the GIL and needs the "StandbyPyModuleState::lock" while thread 11 holds the "StandbyPyModuleState::lock" and needs the GIL. Classic deadlock.
This appears to have been fixed in https://github.com/ceph/ceph/pull/25360/commits, specifically https://github.com/ceph/ceph/pull/25360/commits/d590a5391876eda23ef286af5800e5bdbb2b4b97 That's not a straight backport however.
Updated by Brad Hubbard about 5 years ago
- Status changed from New to 15
- Assignee set to Brad Hubbard
- Backport set to mimic, luminous
Updated by Brad Hubbard about 5 years ago
- Status changed from 15 to Pending Backport
Updated by Brad Hubbard about 5 years ago
- Copied to Backport #38459: mimic: deadlock in standby ceph-mgr daemons added
Updated by Brad Hubbard about 5 years ago
- Copied to Backport #38460: luminous: deadlock in standby ceph-mgr daemons added
Updated by Brad Hubbard about 5 years ago
- Subject changed from luminous: deadlock in standby ceph-mgr daemons to deadlock in standby ceph-mgr daemons
Updated by Lenz Grimmer about 5 years ago
- Related to Bug #38407: Funny issues with python sub-interpreters added
Updated by Lenz Grimmer about 5 years ago
- Related to Cleanup #38467: Audit other functions in src/mgr/ActivePyModules.cc for thread safety in light of deadlock seen in #35985 added
Updated by Sebastian Wagner about 5 years ago
- Related to deleted (Bug #38407: Funny issues with python sub-interpreters)
Updated by Nathan Cutler about 5 years ago
Backport issues look good! Thanks, @Brad.
Updated by Nathan Cutler about 5 years ago
- Status changed from Pending Backport to Resolved
Updated by Brad Hubbard about 5 years ago
The original patch (d590a5391876eda23ef286af5800e5bdbb2b4b97) addressed a race in ActivePyModules.cc whilst the following addresses the race in BaseMgrStandbyModule.cc.
Updated by Sebastian Wagner almost 5 years ago
- Related to Bug #39335: deadlock on command completion added
Updated by Kefu Chai over 4 years ago
- Has duplicate Bug #42086: luminous : standby mgr down after disable/enabling modules repeatly added