Project

General

Profile

Actions

Bug #38407

closed

Funny issues with python sub-interpreters

Added by Sebastian Wagner about 5 years ago. Updated over 2 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
python interface
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Orchestrator is now running into funny issues with python sub-interpreters.

https://modwsgi.readthedocs.io/en/develop/user-guides/application-issues.html#multiple-python-sub-interpreters

Stuff like:

Error EINVAL: Traceback (most recent call last):
  File "/home/sebastian/Repos/ceph/src/pybind/mgr/mgr_module.py", line 862, in _handle_command
    return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/home/sebastian/Repos/ceph/src/pybind/mgr/mgr_module.py", line 337, in call
    return self.func(mgr, **kwargs)
  File "/home/sebastian/Repos/ceph/src/pybind/mgr/orchestrator_cli/module.py", line 23, in handle_exception
    return func(*args, **kwargs)
  File "/home/sebastian/Repos/ceph/src/pybind/mgr/orchestrator_cli/module.py", line 93, in _list_devices
    completion.raise_if_exception()
  File "/home/sebastian/Repos/ceph/src/pybind/mgr/orchestrator.py", line 99, in raise_if_exception
    dumped_exception = pickle.dumps(self.exception)
  File "/usr/lib/python2.7/pickle.py", line 1380, in dumps
    Pickler(file, protocol).dump(obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python2.7/pickle.py", line 400, in save_reduce
    save(func)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 748, in save_global
    __import__(module)
RuntimeError: cannot unmarshal code objects in restricted execution mode

or broken ininstance() calls within the orchestrator.

This is not going to be sustainable in the long term.


Related issues 3 (2 open1 closed)

Related to mgr - Bug #37472: Cython 0.29 removed support for subinterpreters: raises "ImportError: Interpreter change detected ..."ResolvedRicardo Dias

Actions
Related to mgr - Bug #45574: subinterpreters: ceph/mgr/rook RuntimeError on import of RookOrchestrator - ceph cluster does not startNew

Actions
Blocked by mgr - Cleanup #63294: mgr: enable per-subinterpreter GIL (Python >= 3.12) New

Actions
Actions #1

Updated by Tim Serong about 5 years ago

I think the original intent was only to pass data around between modules, not things that could be executed (see the commit message in https://github.com/ceph/ceph/pull/22951/commits/f02316adb4baf4dceaab79cc0ef4c2acdb544f3e)

Actions #2

Updated by Sebastian Wagner about 5 years ago

Tim Serong wrote:

I think the original intent was only to pass data around between modules

yeah, you don't even need to execute code. passing objects is enough to get into the realm of funny things

Actions #3

Updated by Tim Serong about 5 years ago

Is that just for objects that contain code (i.e. methods), or objects that only contain data? In any case, I think we probably just can't pass objects around between modules, and need to stick to data only.

We can't fix this by removing subinterpreters and going back to a single interpreter that runs all mgr modules, because that will mean that various mgr module dependencies conflict with each other: think two mgr modules that both use cherrypy -- if these are run in a single interpreter, only one will work.

If we do remove subinterepters in future, I believe we will have to switch to an even more isolated structure (e.g. a separate process per module with some form of IPC (or shared memory? not sure)), which also wouldn't support passing objects around.

Actions #4

Updated by Sebastian Wagner about 5 years ago

Tim Serong wrote:

Is that just for objects that contain code (i.e. methods), or objects that only contain data? In any case, I think we probably just can't pass objects around between modules, and need to stick to data only.

there are actually at least two parts of this problem:

  1. Passing user defined types around is enough to cause problems with module identity. Anything that depends on module identity breaks: try: except, isinstance, issubclass, etc...
  2. Executing code outside of it's sub-interpreter context, e.g. all code called by remote, like all methods in deepsea.module.DeepSeaOrchestrator that implement methods of Prchestrator including wait: They run in that infamous "Reduced Execution Environment"

We can't fix this by removing subinterpreters and going back to a single interpreter that runs all mgr modules, because that will mean that various mgr module dependencies conflict with each other: think two mgr modules that both use cherrypy -- if these are run in a single interpreter, only one will work.

I've had some bad experience in having multiple sub-interpreters running HTTPS webservers using a common OpenSSL library. I don't know if we really want to have so many different web servers executing within the mgr.

If we do remove subinterepters in future, I believe we will have to switch to an even more isolated structure (e.g. a separate process per module with some form of IPC (or shared memory? not sure)), which also wouldn't support passing objects around.

Maybe that's already enough? Like don't try to pass any Python objects around and don't directly call any functions and instead perform some RPC between sub-interpreters.

Actions #5

Updated by Tim Serong about 5 years ago

Sebastian Wagner wrote:

We can't fix this by removing subinterpreters and going back to a single interpreter that runs all mgr modules, because that will mean that various mgr module dependencies conflict with each other: think two mgr modules that both use cherrypy -- if these are run in a single interpreter, only one will work.

I've had some bad experience in having multiple sub-interpreters running HTTPS webservers using a common OpenSSL library. I don't know if we really want to have so many different web servers executing within the mgr.

cherrypy was just an example - the general problem is that if two mgr modules have some python module as a dependency which in turn has some global state, it's going to break.

If we do remove subinterepters in future, I believe we will have to switch to an even more isolated structure (e.g. a separate process per module with some form of IPC (or shared memory? not sure)), which also wouldn't support passing objects around.

Maybe that's already enough? Like don't try to pass any Python objects around and don't directly call any functions and instead perform some RPC between sub-interpreters.

If we're going to develop an RPC solution, I'd suggest waiting until we get rid of subinterpreters and switch to multiple processes. In the meantime, I don't see the benefit of developing some sort of RPC system to use with subinterpreters, over, say, just passing simple data types around.

Actions #6

Updated by Lenz Grimmer about 5 years ago

  • Related to Cleanup #38467: Audit other functions in src/mgr/ActivePyModules.cc for thread safety in light of deadlock seen in #35985 added
Actions #7

Updated by Lenz Grimmer about 5 years ago

  • Related to Bug #35985: deadlock in standby ceph-mgr daemons added
Actions #8

Updated by Sebastian Wagner about 5 years ago

I just don't know if multiple processes are the best way. One big advantage by using one mgr process is the ability to have shared mgr data structures in the mgr memory that we don't need to replicate and synchronize between modules.

Actions #9

Updated by Brad Hubbard about 5 years ago

#35985 and #38467 are about thread deadlocks. What evidence do you have that they are related to the issue reported here?

Actions #10

Updated by Tim Serong about 5 years ago

Sebastian Wagner wrote:

I just don't know if multiple processes are the best way. One big advantage by using one mgr process is the ability to have shared mgr data structures in the mgr memory that we don't need to replicate and synchronize between modules.

I agree completely. But I suspect that eventually we'll have to ditch subinterpreters and split into separate processes, because subinterpreters are a relatively underused/undertested/underdocumented feature of python, and I don't want to have to do more weird things like https://github.com/ceph/ceph/pull/25585 to keep everything working :-)

Actions #11

Updated by Sebastian Wagner about 5 years ago

I don't want to have to do more weird things like https://github.com/ceph/ceph/pull/25585 to keep everything working :-)

And I also completely agree wit this :-)

Actions #12

Updated by Sebastian Wagner about 5 years ago

Brad Hubbard wrote:

#35985 and #38467 are about thread deadlocks. What evidence do you have that they are related to the issue reported here?

I don't think they are related.

Actions #13

Updated by Sebastian Wagner about 5 years ago

  • Related to Bug #37472: Cython 0.29 removed support for subinterpreters: raises "ImportError: Interpreter change detected ..." added
Actions #14

Updated by Brad Hubbard about 5 years ago

Sebastian Wagner wrote:

Brad Hubbard wrote:

#35985 and #38467 are about thread deadlocks. What evidence do you have that they are related to the issue reported here?

I don't think they are related.

But you attached them as "Related issues" ?

Actions #15

Updated by Sebastian Wagner about 5 years ago

  • Related to deleted (Cleanup #38467: Audit other functions in src/mgr/ActivePyModules.cc for thread safety in light of deadlock seen in #35985)
Actions #16

Updated by Sebastian Wagner about 5 years ago

  • Related to deleted (Bug #35985: deadlock in standby ceph-mgr daemons)
Actions #17

Updated by Sebastian Wagner about 5 years ago

removed unrelated issues.

Actions #19

Updated by Sebastian Wagner almost 3 years ago

  • Related to Bug #45574: subinterpreters: ceph/mgr/rook RuntimeError on import of RookOrchestrator - ceph cluster does not start added
Actions #20

Updated by Sebastian Wagner over 2 years ago

  • Status changed from New to Can't reproduce
Actions #21

Updated by Ernesto Puerta 6 months ago

  • Blocked by Cleanup #63294: mgr: enable per-subinterpreter GIL (Python >= 3.12) added
Actions

Also available in: Atom PDF