Project

General

Profile

Bug #35985

deadlock in standby ceph-mgr daemons

Added by John Spray 10 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
ceph-mgr
Target version:
-
Start date:
09/14/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

From "[ceph-users] Standby mgr stopped sending beacons after upgrade to 12.2.8"

Thread 11 (Thread 0x7fc30888d700 (LWP 224053)):
#0  0x00007fc30f2e0afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007fc30f2e0b8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fc30f2e0c2b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007fc311275735 in PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0
#4  0x00007fc311241296 in PyEval_RestoreThread () from /lib64/libpython2.7.so.1.0
#5  0x00007fc31127942e in lock_PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0
#6  0x00007fc311248cf0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#7  0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#8  0x00007fc31124853c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#9  0x00007fc3112486bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#10 0x00007fc3112486bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#11 0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#12 0x00007fc31124853c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#13 0x00007fc3112486bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#14 0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#15 0x00007fc3111d4978 in function_call () from /lib64/libpython2.7.so.1.0
#16 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#17 0x00007fc3111bea55 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#18 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#19 0x00007fc311206a87 in slot_tp_init () from /lib64/libpython2.7.so.1.0
#20 0x00007fc31120579f in type_call () from /lib64/libpython2.7.so.1.0
#21 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#22 0x00007fc3112418f7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#23 0x000056446a400480 in StandbyPyModule::load (this=0x564475c03420) at /usr/src/debug/ceph-12.2.8/src/mgr/StandbyPyModules.cc:124
#24 0x000056446a40160f in StandbyPyModules::start_one (this=0x564476345340, module_name="prometheus", pClass=<optimized out>, 
    pMyThreadState=...) at /usr/src/debug/ceph-12.2.8/src/mgr/StandbyPyModules.cc:96
#25 0x000056446a405265 in PyModuleRegistry::standby_start (this=this@entry=0x7ffd3bf1e680, monc=monc@entry=0x7ffd3bf1c8c8)
    at /usr/src/debug/ceph-12.2.8/src/mgr/PyModuleRegistry.cc:321
#26 0x000056446a41a246 in MgrStandby::handle_mgr_map (this=this@entry=0x7ffd3bf1c8b0, mmap=mmap@entry=0x5644755942c0)
    at /usr/src/debug/ceph-12.2.8/src/mgr/MgrStandby.cc:361
#27 0x000056446a41ab04 in MgrStandby::ms_dispatch (this=0x7ffd3bf1c8b0, m=0x5644755942c0)
    at /usr/src/debug/ceph-12.2.8/src/mgr/MgrStandby.cc:376
#28 0x000056446a815cb2 in ms_deliver_dispatch (m=0x5644755942c0, this=0x564475332700) at /usr/src/debug/ceph-12.2.8/src/msg/Messenger.h:668
#29 DispatchQueue::entry (this=0x564475332858) at /usr/src/debug/ceph-12.2.8/src/msg/DispatchQueue.cc:197
#30 0x000056446a5ffbed in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /usr/src/debug/ceph-12.2.8/src/msg/DispatchQueue.h:101
#31 0x00007fc30f2dae25 in start_thread () from /lib64/libpthread.so.0
#32 0x00007fc30e3babad in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fc2eb7dc700 (LWP 224066)):
#0  0x00007fc30f2de995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000056446a402e1e in Cond::Wait (this=0x564476345558, mutex=...) at /usr/src/debug/ceph-12.2.8/src/common/Cond.h:48
#2  0x000056446a400ab5 in with_config<StandbyPyModule::get_config(const string&, std::string*) const::__lambda8> (
    cb=<unknown type in /usr/lib/debug/usr/bin/ceph-mgr.debug, CU 0xc5a7bd, DIE 0xe5b056>, this=0x5644763453d0)
    at /usr/src/debug/ceph-12.2.8/src/mgr/StandbyPyModules.h:76
#3  StandbyPyModule::get_config (this=0x564475c036c0, key="ceph06/server_addr", value=value@entry=0x7fc2eb7d9ef0)
    at /usr/src/debug/ceph-12.2.8/src/mgr/StandbyPyModules.cc:186
#4  0x000056446a413a91 in ceph_config_get (self=0x7fc2ec096bd8, args=<optimized out>)
    at /usr/src/debug/ceph-12.2.8/src/mgr/BaseMgrStandbyModule.cc:73
#5  0x00007fc311248cf0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#6  0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#7  0x00007fc31124853c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#8  0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#9  0x00007fc31124853c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#10 0x00007fc31124b03d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#11 0x00007fc3111d4978 in function_call () from /lib64/libpython2.7.so.1.0
#12 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#13 0x00007fc3111bea55 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#14 0x00007fc3111afa63 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#15 0x00007fc3111afb45 in call_function_tail () from /lib64/libpython2.7.so.1.0
#16 0x00007fc3111afe7b in PyObject_CallMethod () from /lib64/libpython2.7.so.1.0
#17 0x000056446a407ffc in PyModuleRunner::serve (this=0x564475c036c0) at /usr/src/debug/ceph-12.2.8/src/mgr/PyModuleRunner.cc:51
#18 0x000056446a40867f in PyModuleRunner::PyModuleRunnerThread::entry (this=0x564475c036f8)
    at /usr/src/debug/ceph-12.2.8/src/mgr/PyModuleRunner.cc:112
#19 0x00007fc30f2dae25 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fc30e3babad in clone () from /lib64/libc.so.6

Related issues

Related to mgr - Cleanup #38467: Audit other functions in src/mgr/ActivePyModules.cc for thread safety in light of deadlock seen in #35985 New 02/25/2019
Related to mgr - Bug #39335: deadlock on command completion Resolved 04/16/2019
Copied to mgr - Backport #38459: mimic: deadlock in standby ceph-mgr daemons Resolved
Copied to mgr - Backport #38460: luminous: deadlock in standby ceph-mgr daemons Resolved

History

#1 Updated by John Spray 10 months ago

  • Category set to ceph-mgr

#2 Updated by John Spray 10 months ago

StandbyPyModule::get_config is using state.with_config without dropping the GIL around taking the lock.

#3 Updated by Brad Hubbard 5 months ago

As John said thread 2 is here in src/mgr/StandbyPyModules.cc

168 bool StandbyPyModule::get_config(const std::string &key,                                                                                                                                                                                 
169                                  std::string *value) const                                                                                                                                                                               
170 {                                                                                                                                                                                                                                        
171   PyThreadState *tstate = PyEval_SaveThread();                                                                                                                                                                                           
172   PyEval_RestoreThread(tstate);                                                                                                                                                                                                          
173                                                                                                                                                                                                                                          
174   const std::string global_key = PyModuleRegistry::config_prefix                                                                                                                                                                         
175     + module_name + "/" + key;                                                                                                                                                                                                           
176                                                                                                                                                                                                                                          
177   dout(4) << __func__ << "key: " << global_key << dendl;                                                                                                                                                                                 
178                                                                                                                                                                                                                                          
179   return state.with_config([global_key, value](const PyModuleConfig &config){                                                                                                                                                            
180     if (config.count(global_key)) {                                                                                                                                                                                                      
181       *value = config.at(global_key);                                                                                                                                                                                                    
182       return true;                                                                                                                                                                                                                       
183     } else {                                                                                                                                                                                                                             
184       return false;                                                                                                                                                                                                                      
185     }
186   });

It's taken the GIL via PyEval_RestoreThread and is currently calling state.with_config which looks like this.

 34 class StandbyPyModuleState
 35 {
 36   mutable Mutex lock{"StandbyPyModuleState::lock"};
...
 70   template<typename Callback, typename...Args>
 71   auto with_config(Callback&& cb, Args&&... args) const ->                                                                                                                                                                               
 72     decltype(cb(config_cache, std::forward<Args>(args)...)) {
 73     Mutex::Locker l(lock);
 74 
 75     if (!is_config_loaded) {
 76       config_loaded.Wait(lock);  <-- HERE
 77     }

It's waiting to acquire the "StandbyPyModuleState::lock". Meanwhile thread 11 has taken the same lock in StandbyPyModules::start_one and has called "load".

 80 int StandbyPyModules::start_one(std::string const &module_name,                                                                                                                                                                          
 81     PyObject *pClass, const SafeThreadState &pMyThreadState)                                                                                                                                                                             
 82 {                                                                                                                                                                                                                                        
 83   Mutex::Locker l(lock); <--  "StandbyPyModuleState::lock"                                                                                                                                                                                                               
 84                                                                                                                                                                                                                                          
 85   assert(modules.count(module_name) == 0);                                                                                                                                                                                               
 86                                                                                                                                                                                                                                          
 87   modules[module_name].reset(new StandbyPyModule(                                                                                                                                                                                        
 88       state,                                                                                                                                                                                                                             
 89       module_name, pClass,                                                                                                                                                                                                               
 90       pMyThreadState, clog));                                                                                                                                                                                                            
 91                                                                                                                                                                                                                                          
 92   if (modules.size() == 1) {                                                                                                                                                                                                             
 93     load_config_thread.create("LoadConfig");                                                                                                                                                                                             
 94   }                                                                                                                                                                                                                                      
 95                                                                                                                                                                                                                                          
 96   int r = modules[module_name]->load(); <-- HERE

In load we create a GIL object but the call to acquire the GIL is deferred until we call PyThread_acquire_lock via the call to PyObject_CallObject.

110 int StandbyPyModule::load()                                                                                                                                                                                                              
111 {                                                                                                                                                                                                                                        
112   Gil gil(pMyThreadState, true);                                                                                                                                                                                                         
113                                                                                                                                                                                                                                          
114   // We tell the module how we name it, so that it can be consistent                                                                                                                                                                     
115   // with us in logging etc.                                                                                                                                                                                                             
116   auto pThisPtr = PyCapsule_New(this, nullptr, nullptr);                                                                                                                                                                                 
117   assert(pThisPtr != nullptr);                                                                                                                                                                                                           
118   auto pModuleName = PyString_FromString(module_name.c_str());                                                                                                                                                                           
119   assert(pModuleName != nullptr);                                                                                                                                                                                                        
120   auto pArgs = PyTuple_Pack(2, pModuleName, pThisPtr);                                                                                                                                                                                   
121   Py_DECREF(pThisPtr);                                                                                                                                                                                                                   
122   Py_DECREF(pModuleName);                                                                                                                                                                                                                
123                                                                                                                                                                                                                                          
124   pClassInstance = PyObject_CallObject(pClass, pArgs); <--- HERE

So thread 2 holds the GIL and needs the "StandbyPyModuleState::lock" while thread 11 holds the "StandbyPyModuleState::lock" and needs the GIL. Classic deadlock.

This appears to have been fixed in https://github.com/ceph/ceph/pull/25360/commits, specifically https://github.com/ceph/ceph/pull/25360/commits/d590a5391876eda23ef286af5800e5bdbb2b4b97 That's not a straight backport however.

#4 Updated by Brad Hubbard 5 months ago

  • Status changed from New to Pending Upstream
  • Assignee set to Brad Hubbard
  • Backport set to mimic, luminous

#5 Updated by Brad Hubbard 5 months ago

  • Status changed from Pending Upstream to Pending Backport

#6 Updated by Brad Hubbard 5 months ago

  • Copied to Backport #38459: mimic: deadlock in standby ceph-mgr daemons added

#7 Updated by Brad Hubbard 5 months ago

  • Copied to Backport #38460: luminous: deadlock in standby ceph-mgr daemons added

#8 Updated by Brad Hubbard 5 months ago

  • Subject changed from luminous: deadlock in standby ceph-mgr daemons to deadlock in standby ceph-mgr daemons

#9 Updated by Lenz Grimmer 5 months ago

  • Related to Bug #38407: Funny issues with python sub-interpreters added

#10 Updated by Lenz Grimmer 5 months ago

  • Related to Cleanup #38467: Audit other functions in src/mgr/ActivePyModules.cc for thread safety in light of deadlock seen in #35985 added

#11 Updated by Sebastian Wagner 5 months ago

  • Related to deleted (Bug #38407: Funny issues with python sub-interpreters)

#12 Updated by Nathan Cutler 5 months ago

Backport issues look good! Thanks, @Brad.

#13 Updated by Nathan Cutler 4 months ago

  • Status changed from Pending Backport to Resolved

#14 Updated by Brad Hubbard 3 months ago

The original patch (d590a5391876eda23ef286af5800e5bdbb2b4b97) addressed a race in ActivePyModules.cc whilst the following addresses the race in BaseMgrStandbyModule.cc.

https://github.com/ceph/ceph/pull/27639

#15 Updated by Sebastian Wagner 2 months ago

  • Related to Bug #39335: deadlock on command completion added

#16 Updated by Sebastian Wagner 2 months ago

  • Pull request ID set to 27639

Also available in: Atom PDF