Project

General

Profile

Actions

Bug #38621

closed

py modules fail to cleanly shut down

Added by Alfonso Martínez about 5 years ago. Updated over 4 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When running backend api tests locally on master:

src/pybind/mgr/dashboard/run-backend-api-tests.sh

2019-03-07 09:53:12,042.042 INFO:__main__:Running ['./bin/ceph', 'mgr', 'fail', 'y']
2019-03-07 09:53:14,143.143 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:14,157.157 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:14,157.157 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:14,181.181 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:14,182.182 INFO:__main__:Killing PID 4214 for mgr.x
2019-03-07 09:53:14,182.182 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:14,199.199 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:15,200.200 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:15,210.210 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:16,212.212 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:16,240.240 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:17,243.243 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:17,276.276 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:18,280.280 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:18,312.312 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:19,315.315 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:19,350.350 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x
2019-03-07 09:53:20,352.352 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-03-07 09:53:20,385.385 INFO:__main__:Found ps line for daemon: 4214 ? Ssl 0:03 ./bin/./ceph-mgr -i x

[...]

2019-03-07 09:58:25,918.918 INFO:__main__:======================================================================
2019-03-07 09:58:25,919.919 INFO:__main__:ERROR: test_standby (tasks.mgr.test_dashboard.TestDashboard)
2019-03-07 09:58:25,919.919 INFO:__main__:----------------------------------------------------------------------
2019-03-07 09:58:25,920.920 INFO:__main__:Traceback (most recent call last):
2019-03-07 09:58:25,920.920 INFO:__main__: File "/ceph/qa/tasks/mgr/test_dashboard.py", line 18, in setUp
2019-03-07 09:58:25,921.921 INFO:__main__: self._assign_ports("dashboard", "server_port")
2019-03-07 09:58:25,921.921 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 182, in assign_ports
2019-03-07 09:58:25,922.922 INFO:
_main__: cls.mgr_cluster.mgr_stop(mgr_id)
2019-03-07 09:58:25,922.922 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 30, in mgr_stop
2019-03-07 09:58:25,923.923 INFO:__main__: self.mgr_daemons[mgr_id].stop()
2019-03-07 09:58:25,924.924 INFO:__main__: File "../qa/tasks/vstart_runner.py", line 373, in stop
2019-03-07 09:58:25,924.924 INFO:__main__: self.daemon_type, self.daemon_id))
2019-03-07 09:58:25,925.925 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x

In mgr.x.log there is this error:

0> 2019-03-07 09:52:42.753 7f014909e700 -1 ** Caught signal (Aborted) *
in thread 7f014909e700 thread_name:mgrsb-fin
ceph version Development (no_version) nautilus (rc)
1: (()+0x9cc178) [0x55c47e5dd178]
2: (()+0x13030) [0x7f014eff2030]
3: (gsignal()+0x10f) [0x7f014e6a953f]
4: (abort()+0x127) [0x7f014e693895]
5: (()+0x7012e) [0x7f014f7f412e]
6: (PyGrammar_LabelRepr()+0) [0x7f014f7f418e]
7: (()+0x722ea) [0x7f014f7f62ea]
8: (Py_FinalizeEx()+0x13e) [0x7f014f9bf89e]
9: (PyModuleRegistry::shutdown()+0xa3) [0x55c47e43d263]
10: (()+0x7f3b02) [0x55c47e404b02]
11: (()+0x7f6932) [0x55c47e407932]
12: (boost::function1<void, int>::operator()(int) const+0x6c) [0x55c47e312a7c]
13: (FunctionContext::finish(int)+0x24) [0x55c47e30f936]
14: (Context::complete(int)+0x27) [0x55c47e30f7fd]
15: (Finisher::finisher_thread_entry()+0x38b) [0x7f0150e0a36d]
16: (Finisher::FinisherThread::entry()+0x1c) [0x55c47e3f1a30]
17: (Thread::entry_wrapper()+0x78) [0x7f0150e7659e]
18: (Thread::_entry_func(void*)+0x18) [0x7f0150e7651c]
19: (()+0x858e) [0x7f014efe758e]
20: (clone()+0x43) [0x7f014e76e6a3]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Files

test-standby-api-tests.log (323 KB) test-standby-api-tests.log Alfonso Martínez, 03/07/2019 10:18 AM
test-standby-mgr-x.log (175 KB) test-standby-mgr-x.log Alfonso Martínez, 03/07/2019 10:22 AM
Actions #2

Updated by Alfonso Martínez about 5 years ago

  • Description updated (diff)
Actions #3

Updated by Laura Paduano about 5 years ago

  • Affected Versions v14.0.0 added
Actions #4

Updated by Sage Weil about 5 years ago

  • Affected Versions deleted (v14.0.0)

full backtrace with debugging symbols:

#0  0x00007f93ff93becf in raise () from /lib64/libpthread.so.0
#1  0x00005610c711df07 in reraise_fatal (signum=6) at /home/sage/src/ceph/src/global/signal_handler.cc:81
#2  0x00005610c711ee2b in handle_fatal_signal (signum=6) at /home/sage/src/ceph/src/global/signal_handler.cc:298
#3  <signal handler called>
#4  0x00007f93feff353f in raise () from /lib64/libc.so.6
#5  0x00007f93fefdd895 in abort () from /lib64/libc.so.6
#6  0x00007f940013d12e in fatal_error (prefix=prefix@entry=0x0, msg=msg@entry=0x7f940034d090 "PyInterpreterState_Delete: remaining subinterpreters", status=status@entry=-1)
    at /usr/src/debug/python3-3.7.2-4.fc29.x86_64/Python/pylifecycle.c:2179
#7  0x00007f940013d18e in Py_FatalError (msg=msg@entry=0x7f940034d090 "PyInterpreterState_Delete: remaining subinterpreters") at /usr/src/debug/python3-3.7.2-4.fc29.x86_64/Python/pylifecycle.c:2189
#8  0x00007f940013f2ea in PyInterpreterState_Delete (interp=0x5610c9e98b00) at /usr/src/debug/python3-3.7.2-4.fc29.x86_64/Python/pystate.c:252
#9  0x00007f940030889e in Py_FinalizeEx () at /usr/src/debug/python3-3.7.2-4.fc29.x86_64/Python/pylifecycle.c:1311
#10 0x00005610c6f7e263 in PyModuleRegistry::shutdown (this=0x7fff2fcf5590) at /home/sage/src/ceph/src/mgr/PyModuleRegistry.cc:255
#11 0x00005610c6f45b02 in MgrStandby::<lambda(int)>::operator()(int) const (__closure=0x5610cd033a50) at /home/sage/src/ceph/src/mgr/MgrStandby.cc:289
#12 0x00005610c6f48932 in boost::detail::function::void_function_obj_invoker1<MgrStandby::shutdown()::<lambda(int)>, void, int>::invoke(boost::detail::function::function_buffer &, int) (function_obj_ptr=..., a0=0)
    at /nvm/src/ceph/build/boost/include/boost/function/function_template.hpp:159
#13 0x00005610c6e53a7c in boost::function1<void, int>::operator() (this=0x5610cd033a48, a0=0) at /nvm/src/ceph/build/boost/include/boost/function/function_template.hpp:768
#14 0x00005610c6e50936 in FunctionContext::finish (this=0x5610cd033a40, r=0) at /home/sage/src/ceph/src/include/Context.h:487
#15 0x00005610c6e507fd in Context::complete (this=0x5610cd033a40, r=0) at /home/sage/src/ceph/src/include/Context.h:77
#16 0x00007f940176836d in Finisher::finisher_thread_entry (this=0x7fff2fcf5378) at /home/sage/src/ceph/src/common/Finisher.cc:67
#17 0x00005610c6f32a30 in Finisher::FinisherThread::entry (this=0x7fff2fcf5498) at /home/sage/src/ceph/src/common/Finisher.h:62
#18 0x00007f94017d459e in Thread::entry_wrapper (this=0x7fff2fcf5498) at /home/sage/src/ceph/src/common/Thread.cc:84
#19 0x00007f94017d451c in Thread::_entry_func (arg=0x7fff2fcf5498) at /home/sage/src/ceph/src/common/Thread.cc:71
#20 0x00007f93ff93158e in start_thread () from /lib64/libpthread.so.0
#21 0x00007f93ff0b86a3 in clone () from /lib64/libc.so.6

note, "PyInterpreterState_Delete: remaining subinterpreters"

i think shutdown() is just wrong. the comment there:

  // Ideally, now, we'd be able to do this for all modules:
  //
  //    Py_EndInterpreter(pMyThreadState);
  //    PyThreadState_Swap(pMainThreadState);
  //
  // Unfortunately, if the module has any other *python* threads active
  // at this point, Py_EndInterpreter() will abort with:
  //
  //    Fatal Python error: Py_EndInterpreter: not the last thread
  //
  // This can happen when using CherryPy in a module, becuase CherryPy
  // runs an extra thread as a timeout monitor, which spends most of its
  // life inside a time.sleep(60).  Unless you are very, very lucky with
  // the timing calling this destructor, that thread will still be stuck
  // in a sleep, and Py_EndInterpreter() will abort.
  //
  // This could of course also happen with a poorly written module which
  // made no attempt to clean up any additional threads it created.
  //
  // The safest thing to do is just not call Py_EndInterpreter(), and
  // let Py_Finalize() kill everything after all modules are shut down.

Actions #5

Updated by Sage Weil about 5 years ago

  • Status changed from New to 12
  • Priority changed from Normal to Urgent
Actions #6

Updated by Sage Weil about 5 years ago

I added some debugging,

diff --git a/src/mgr/PyModule.cc b/src/mgr/PyModule.cc
index 53d9e2b95f..498e3cebb1 100644
--- a/src/mgr/PyModule.cc
+++ b/src/mgr/PyModule.cc
@@ -322,6 +323,8 @@ int PyModule::load(PyThreadState *pMainThreadState)
     Gil gil(sts);

     auto thread_state = Py_NewInterpreter();
+    lderr(g_ceph_context) << __func__ << " new interpreter " << thread_state << dendl;
+
     if (thread_state == nullptr) {
       derr << "Failed to create python sub-interpreter for '" << module_name << '"' << dendl;
       return -EINVAL;
@@ -716,6 +719,10 @@ PyModule::~PyModule()
 {
   if (pMyThreadState.ts != nullptr) {
     Gil gil(pMyThreadState, true);
+    derr << "~PyModule " << module_name << " interpreter " << pMyThreadState.ts
+        << " pClass refs " << (pClass ? Py_REFCNT(pClass) : 0)
+        << " pStandbyClass refs " << (pStandbyClass ? Py_REFCNT(pStandbyClass) : 0)
+        << dendl;
     Py_XDECREF(pClass);
     Py_XDECREF(pStandbyClass);
   }

and get this,
   -87> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x5638407f2120
   -86> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule zabbix interpreter 0x5638434525a0 pClass refs 4 pStandbyClass refs 0
   -85> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x5638407f2120
   -84> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563843452480
   -83> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule volumes interpreter 0x563843452360 pClass refs 0 pStandbyClass refs 0
   -82> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563843452480
   -81> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563843452480
   -80> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule test_orchestrator interpreter 0x563843452120 pClass refs 4 pStandbyClass refs 0
   -79> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563843452480
   -78> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563843452000
   -77> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule telemetry interpreter 0x563841dd5e60 pClass refs 4 pStandbyClass refs 0
   -76> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563843452000
   -75> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563843452480
   -74> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule telegraf interpreter 0x563841dd5c20 pClass refs 4 pStandbyClass refs 0
   -73> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563843452480
   -72> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563841dd5b00
   -71> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule status interpreter 0x563841dd58c0 pClass refs 4 pStandbyClass refs 0
   -70> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563841dd5b00
   -69> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr Gil Switched to new thread state 0x563841dd4a20
   -68> 2019-03-07 11:02:18.584 7f119c10d700 -1 mgr[py] ~PyModule ssh interpreter 0x563841dd4c60 pClass refs 4 pStandbyClass refs 0
   -67> 2019-03-07 11:02:18.584 7f119c10d700 20 mgr ~Gil Destroying new thread state 0x563841dd4a20
...

I have no idea where those refs are coming from.

FWIW I didn't notice these crashes until I switched to python3, but apparently they can trigger on python 2 too. in my current python3 fc29 environment i trivially trigger this with vstart + kill `cat out/mgr.x.pid`.

Actions #7

Updated by Sage Weil about 5 years ago

  • Priority changed from Urgent to High

Merged the workaround (which does an immediate exit from mgr).

I think we want to fix this properly by allowing the modules to shut down, though, so we can reenable leak detection in the mgr.

Actions #8

Updated by Sage Weil about 5 years ago

  • Subject changed from ceph-mgr got SIGTERM but it keeps alive when running API tests based on teuthology to py modules fail to cleanly shut down
Actions #10

Updated by Lenz Grimmer about 5 years ago

  • Category changed from 151 to ceph-mgr
Actions #11

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions #12

Updated by Sage Weil over 4 years ago

  • Status changed from New to Won't Fix
Actions

Also available in: Atom PDF