Bug #38621
closedpy modules fail to cleanly shut down
0%
Description
When running backend api tests locally on master:
src/pybind/mgr/dashboard/run-backend-api-tests.sh
2019-03-07 09:53:12,042.042 INFO:__main__:Running ['./bin/ceph', 'mgr', 'fail', 'y']
2019-03-07 09:53:14,143.143 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:14,157.157 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:14,157.157 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:14,181.181 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:14,182.182 INFO:__main__:Killing PID 4214 for mgr.x
2019-03-07 09:53:14,182.182 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:14,199.199 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:15,200.200 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:15,210.210 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:16,212.212 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:16,240.240 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:17,243.243 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:17,276.276 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:18,280.280 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:18,312.312 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:19,315.315 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:19,350.350 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:20,352.352 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:20,385.385 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
[...]
2019-03-07 09:58:25,918.918 INFO:__main__:======================================================================
2019-03-07 09:58:25,919.919 INFO:__main__:ERROR: test_standby (tasks.mgr.test_dashboard.TestDashboard)
2019-03-07 09:58:25,919.919 INFO:__main__:----------------------------------------------------------------------
2019-03-07 09:58:25,920.920 INFO:__main__:Traceback (most recent call last):
2019-03-07 09:58:25,920.920 INFO:__main__: File "/ceph/qa/tasks/mgr/test_dashboard.py", line 18, in setUp
2019-03-07 09:58:25,921.921 INFO:__main__: self._assign_ports("dashboard", "server_port")
2019-03-07 09:58:25,921.921 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 182, in assign_ports
2019-03-07 09:58:25,922.922 INFO:_main__: cls.mgr_cluster.mgr_stop(mgr_id)
2019-03-07 09:58:25,922.922 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 30, in mgr_stop
2019-03-07 09:58:25,923.923 INFO:__main__: self.mgr_daemons[mgr_id].stop()
2019-03-07 09:58:25,924.924 INFO:__main__: File "../qa/tasks/vstart_runner.py", line 373, in stop
2019-03-07 09:58:25,924.924 INFO:__main__: self.daemon_type, self.daemon_id))
2019-03-07 09:58:25,925.925 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x
In mgr.x.log there is this error:
0> 2019-03-07 09:52:42.753 7f014909e700 -1 ** Caught signal (Aborted) *
in thread 7f014909e700 thread_name:mgrsb-fin
ceph version Development (no_version) nautilus (rc)
1: (()+0x9cc178) [0x55c47e5dd178]
2: (()+0x13030) [0x7f014eff2030]
3: (gsignal()+0x10f) [0x7f014e6a953f]
4: (abort()+0x127) [0x7f014e693895]
5: (()+0x7012e) [0x7f014f7f412e]
6: (PyGrammar_LabelRepr()+0) [0x7f014f7f418e]
7: (()+0x722ea) [0x7f014f7f62ea]
8: (Py_FinalizeEx()+0x13e) [0x7f014f9bf89e]
9: (PyModuleRegistry::shutdown()+0xa3) [0x55c47e43d263]
10: (()+0x7f3b02) [0x55c47e404b02]
11: (()+0x7f6932) [0x55c47e407932]
12: (boost::function1<void, int>::operator()(int) const+0x6c) [0x55c47e312a7c]
13: (FunctionContext::finish(int)+0x24) [0x55c47e30f936]
14: (Context::complete(int)+0x27) [0x55c47e30f7fd]
15: (Finisher::finisher_thread_entry()+0x38b) [0x7f0150e0a36d]
16: (Finisher::FinisherThread::entry()+0x1c) [0x55c47e3f1a30]
17: (Thread::entry_wrapper()+0x78) [0x7f0150e7659e]
18: (Thread::_entry_func(void*)+0x18) [0x7f0150e7651c]
19: (()+0x858e) [0x7f014efe758e]
20: (clone()+0x43) [0x7f014e76e6a3]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by Alfonso Martínez about 5 years ago
Full mgr.x.log:
https://paste.fedoraproject.org/paste/u6k5L9itz94htEvpcIAdXA
Updated by Sage Weil about 5 years ago
- Affected Versions deleted (
v14.0.0)
full backtrace with debugging symbols:
#0 0x00007f93ff93becf in raise () from /lib64/libpthread.so.0 #1 0x00005610c711df07 in reraise_fatal (signum=6) at /home/sage/src/ceph/src/global/signal_handler.cc:81 #2 0x00005610c711ee2b in handle_fatal_signal (signum=6) at /home/sage/src/ceph/src/global/signal_handler.cc:298 #3 <signal handler called> #4 0x00007f93feff353f in raise () from /lib64/libc.so.6 #5 0x00007f93fefdd895 in abort () from /lib64/libc.so.6 #6 0x00007f940013d12e in fatal_error (prefix=prefix@entry=0x0, msg=msg@entry=0x7f940034d090 "PyInterpreterState_Delete: remaining subinterpreters", status=status@entry=-1) at /usr/src/debug/python3-3.7.2-4.fc29.x86_64/Python/pylifecycle.c:2179 #7 0x00007f940013d18e in Py_FatalError (msg=msg@entry=0x7f940034d090 "PyInterpreterState_Delete: remaining subinterpreters") at /usr/src/debug/python3-3.7.2-4.fc29.x86_64/Python/pylifecycle.c:2189 #8 0x00007f940013f2ea in PyInterpreterState_Delete (interp=0x5610c9e98b00) at /usr/src/debug/python3-3.7.2-4.fc29.x86_64/Python/pystate.c:252 #9 0x00007f940030889e in Py_FinalizeEx () at /usr/src/debug/python3-3.7.2-4.fc29.x86_64/Python/pylifecycle.c:1311 #10 0x00005610c6f7e263 in PyModuleRegistry::shutdown (this=0x7fff2fcf5590) at /home/sage/src/ceph/src/mgr/PyModuleRegistry.cc:255 #11 0x00005610c6f45b02 in MgrStandby::<lambda(int)>::operator()(int) const (__closure=0x5610cd033a50) at /home/sage/src/ceph/src/mgr/MgrStandby.cc:289 #12 0x00005610c6f48932 in boost::detail::function::void_function_obj_invoker1<MgrStandby::shutdown()::<lambda(int)>, void, int>::invoke(boost::detail::function::function_buffer &, int) (function_obj_ptr=..., a0=0) at /nvm/src/ceph/build/boost/include/boost/function/function_template.hpp:159 #13 0x00005610c6e53a7c in boost::function1<void, int>::operator() (this=0x5610cd033a48, a0=0) at /nvm/src/ceph/build/boost/include/boost/function/function_template.hpp:768 #14 0x00005610c6e50936 in FunctionContext::finish (this=0x5610cd033a40, r=0) at /home/sage/src/ceph/src/include/Context.h:487 #15 0x00005610c6e507fd in Context::complete (this=0x5610cd033a40, r=0) at /home/sage/src/ceph/src/include/Context.h:77 #16 0x00007f940176836d in Finisher::finisher_thread_entry (this=0x7fff2fcf5378) at /home/sage/src/ceph/src/common/Finisher.cc:67 #17 0x00005610c6f32a30 in Finisher::FinisherThread::entry (this=0x7fff2fcf5498) at /home/sage/src/ceph/src/common/Finisher.h:62 #18 0x00007f94017d459e in Thread::entry_wrapper (this=0x7fff2fcf5498) at /home/sage/src/ceph/src/common/Thread.cc:84 #19 0x00007f94017d451c in Thread::_entry_func (arg=0x7fff2fcf5498) at /home/sage/src/ceph/src/common/Thread.cc:71 #20 0x00007f93ff93158e in start_thread () from /lib64/libpthread.so.0 #21 0x00007f93ff0b86a3 in clone () from /lib64/libc.so.6
note, "PyInterpreterState_Delete: remaining subinterpreters"
i think shutdown() is just wrong. the comment there:
// Ideally, now, we'd be able to do this for all modules: // // Py_EndInterpreter(pMyThreadState); // PyThreadState_Swap(pMainThreadState); // // Unfortunately, if the module has any other *python* threads active // at this point, Py_EndInterpreter() will abort with: // // Fatal Python error: Py_EndInterpreter: not the last thread // // This can happen when using CherryPy in a module, becuase CherryPy // runs an extra thread as a timeout monitor, which spends most of its // life inside a time.sleep(60). Unless you are very, very lucky with // the timing calling this destructor, that thread will still be stuck // in a sleep, and Py_EndInterpreter() will abort. // // This could of course also happen with a poorly written module which // made no attempt to clean up any additional threads it created. // // The safest thing to do is just not call Py_EndInterpreter(), and // let Py_Finalize() kill everything after all modules are shut down.
Updated by Sage Weil about 5 years ago
- Status changed from New to 12
- Priority changed from Normal to Urgent
Updated by Sage Weil about 5 years ago
I added some debugging,
diff --git a/src/mgr/PyModule.cc b/src/mgr/PyModule.cc index 53d9e2b95f..498e3cebb1 100644 --- a/src/mgr/PyModule.cc +++ b/src/mgr/PyModule.cc @@ -322,6 +323,8 @@ int PyModule::load(PyThreadState *pMainThreadState) Gil gil(sts); auto thread_state = Py_NewInterpreter(); + lderr(g_ceph_context) << __func__ << " new interpreter " << thread_state << dendl; + if (thread_state == nullptr) { derr << "Failed to create python sub-interpreter for '" << module_name << '"' << dendl; return -EINVAL; @@ -716,6 +719,10 @@ PyModule::~PyModule() { if (pMyThreadState.ts != nullptr) { Gil gil(pMyThreadState, true); + derr << "~PyModule " << module_name << " interpreter " << pMyThreadState.ts + << " pClass refs " << (pClass ? Py_REFCNT(pClass) : 0) + << " pStandbyClass refs " << (pStandbyClass ? Py_REFCNT(pStandbyClass) : 0) + << dendl; Py_XDECREF(pClass); Py_XDECREF(pStandbyClass); }
and get this,
-87> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x5638407f2120 -86> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule zabbix interpreter 0x5638434525a0 pClass refs 4 pStandbyClass refs 0 -85> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x5638407f2120 -84> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563843452480 -83> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule volumes interpreter 0x563843452360 pClass refs 0 pStandbyClass refs 0 -82> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563843452480 -81> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563843452480 -80> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule test_orchestrator interpreter 0x563843452120 pClass refs 4 pStandbyClass refs 0 -79> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563843452480 -78> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563843452000 -77> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule telemetry interpreter 0x563841dd5e60 pClass refs 4 pStandbyClass refs 0 -76> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563843452000 -75> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563843452480 -74> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule telegraf interpreter 0x563841dd5c20 pClass refs 4 pStandbyClass refs 0 -73> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563843452480 -72> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563841dd5b00 -71> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule status interpreter 0x563841dd58c0 pClass refs 4 pStandbyClass refs 0 -70> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563841dd5b00 -69> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563841dd4a20 -68> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule ssh interpreter 0x563841dd4c60 pClass refs 4 pStandbyClass refs 0 -67> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563841dd4a20 ...
I have no idea where those refs are coming from.
FWIW I didn't notice these crashes until I switched to python3, but apparently they can trigger on python 2 too. in my current python3 fc29 environment i trivially trigger this with vstart + kill `cat out/mgr.x.pid`.
Updated by Sage Weil about 5 years ago
- Priority changed from Urgent to High
Merged the workaround (which does an immediate exit from mgr).
I think we want to fix this properly by allowing the modules to shut down, though, so we can reenable leak detection in the mgr.
Updated by Sage Weil about 5 years ago
- Subject changed from ceph-mgr got SIGTERM but it keeps alive when running API tests based on teuthology to py modules fail to cleanly shut down
Updated by Alfonso Martínez about 5 years ago
Workaround PR:
https://github.com/ceph/ceph/pull/26823
Updated by Lenz Grimmer about 5 years ago
- Category changed from 151 to ceph-mgr