Project

General

Profile

Bug #13115

librbd python bindings eat exceptions in callbacks

Added by Hector Martin over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
09/16/2015
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

If an exception happens in a Python callback (e.g. the argument to rbd_image.diff_iterate), ctypes prints it out to stderr and eats it, then returns garbage instead.

I've investigated if there is a way around this, but haven't found one; even if the Python side attempts to wrap the called code in a try/except and do something sane with exceptions, there is always a window outside the try block where exceptions may arrive (e.g. an asynchronous KeyboardInterrupt caused by SIGINT) that would result in silently missed callbacks with no indication to the caller. In practice, this means diff_iterate cannot be relied upon since a successful result might hide an exception thrown by a callback.

Here's a simple testcase of the problem with ctypes and exceptions:

from ctypes import *

def cb():
    raise KeyboardInterrupt()

CB = CFUNCTYPE(c_int)
cb_c = CB(cb)
res = cb_c()
print "result=", res

Callback exceptions should be converted to a negative (error) return back into diff_iterate, and then when diff_iterate returns an error back to its caller, it should propagate the underlying Python exception to the calling Python code. It doesn't seem this is possible with ctypes at the moment.

I am investigating working around this by building my own librbd Python bindings using Cython (which we already do for librados for other reasons), which should make this implementation possible (and, incidentally, Cython is also faster than ctypes). Would the Ceph project be interested in having this upstreamed as a replacement for the ctypes bindings (particularly if it's API-compatible)?

rbd.pyx - Rough version of Cython bindings for librbd (42.6 KB) Hector Martin, 09/18/2015 08:12 AM

History

#1 Updated by Josh Durgin over 3 years ago

cython looks pretty interesting, with its finer control via sig_[on|off] and 'with [no]gil'.

It sounds like sig_on/off would let us get rid of all the run_in_thread() nonsense in the current librados bindings - that was mainly done to allow KeyboardInterrupt to work for e.g. /usr/bin/ceph.

I'd be interested in cython versions of these bindings. We do want to keep api compatibility, since things like openstack use them. I'm also curious how cython interacts with eventlet and other greenlet threading approaches. If using nogil around the librbd/librados calls lets other greenlets run, that would be even better than the current bindings.

#2 Updated by Hector Martin over 3 years ago

I spent some time yesterday doing a straightforward conversion of rbd.py to Cython. This keeps the same API and passes all the tests in test_rbd.py, except for one which depends on a feature that is missing in librbd in Hammer (which I'm testing on, will fix later).

Attached is a rough cut of the code. You can diff it against rbd.py; the structure is the same as it's mostly a direct conversion.

Advantages:
  • diff_iterate works properly with exceptions
  • Saves one buffer copy in Image.read()
  • Presumably faster (since it compiles down to C code and avoids all of the ctypes/ffi overhead)
  • No need to hardcode/duplicate structure layouts and constant values
Disadvantages:
  • More verbose header bit due to having to list the librbd entrypoints (Cython enforces separate C function declaration from invocation, while in ctypes you can do it inline)
  • Does not gradefully degrade with older librbd versions: since it compiles down to an extension module that links in librbd, all supported functions/symbols need to exist (unless you start playing games with dlsym(), but that's probably not worth it). Would still be upwards-compatible with newer librbd versions (source API compatible/binary ABI compatible). The ctypes version could get away with figuring out what functions are supported at runtime and adapting.
  • Requires Cython for building (unless you ship the generated C source code with release source tarballs?)

This version doesn't do anything interesting with the gil yet, but that can be added/optimized later. It interoperates with the ctypes librados bindings (i.e. knows how to pull out the ioctx pointer).

If this looks reasonable, I'll integrate it into the Ceph build system and send out a PR for the librbd bit. How does that sound? Then I can convert librados in a separate PR.

Getting KeyboardInterrupt to work with blocking Ceph calls is a bit tricky in general though. If the librados API doesn't have a way to somehow signal a fast (but clean) abort from a signal handler, then you basically have two options: either just let the signal kill the entire process with no recourse, or longjmp out of librados and back into Cython (this is probably how Sage's sig_on/sig_off are implemented - those aren't part of Cython itself). The latter, though, means leaving librados in an undefined state, so although you could raise a KeyboardInterrupt back to the user, you can't realistically expect to be able to continue to use librados after that, or at least the affected rados_t if librados doesn't have any global state (does it?), and you have to accept leaking all associated memory and threads since safe cleanup would be impossible.

One safe way to do it would be to set a poison flag in the rados object instance to force every operation to fail after such an interrupt, and make this feature optional for the user (e.g. you can choose to let librados calls block forever and support clean SIGINTs, or make them interruptible as long as you give up the instance after that and accept leaking it, and only if librados has no global state). I see the current librados bindings spawn a thread, and if an interrupt arrives, just forget about it... which kind of works but I can't imagine it being very safe (if that thread keeps on running forever other things might break; if librados is thread-safe then that could cause subsequent calls to block anyway, and if it isn't thread-safe then subsequent calls might crash). If the primary use case is really just letting things like /usr/bin/ceph die immediately on ^C though, then that's easy enough to do by just replacing Python's signal handler around librados calls and letting it kill the process.

Using nogil won't make eventlet work; that would require having librados/librbd themselves built upon some kind of C greenlet library or set of hooks that can interact with the Python eventlet implementation or something like that, since you basically need to use asynchronous versions of everything and yield to the greenlet scheduler when calls block. Short of that, you have to use real threads (but at least nogil makes it easy to play nice with plain Python threads).

#3 Updated by Josh Durgin over 3 years ago

Looks great! Personally I find the cython more readable too.

I'm not worried about older librbd versions. When this eventually becomes a pypi package, we'll just need to be careful about versioning to make it easy to tell which binding works with your librbd. For system packages, the python bindings can simply depend on the same verion of librbd they are built with.

I see what you mean about the gil handling. That makes sense, and means things using eventlet will need to keep using wrappers to avoid blocking. It'd be good to test the cython bindings with eventlet.tpool.Proxy() wrapping long calls like remove(), which openstack is using (e.g. https://github.com/openstack/cinder/blob/f586043fa969b9d1dcf4933aacbf615f53691093/cinder/volume/drivers/rbd.py#L303)

WRT signal handling we only added the threading there to be able to ^C the ceph cli. I'm not really worried about cleanup after SIGINT here - no good use case, and it would be a ton of effort. If temporarily replacing python's signal handler around the librados calls works, and doesn't have any bad side effects, that sounds great.

#4 Updated by Josh Durgin over 3 years ago

Any updates on this? It'd be good to convert to cython sooner than later since there's bound to be more changes to the existing bindings in the meantime.

#5 Updated by Hector Martin over 3 years ago

I had a couple weeks full of IRQs and am just getting back to Ceph work now; I have another Ceph cluster to deploy this week and will get back to the bindings immediately after that (shouldn't take long, I've automated almost everything already).

The current status is that I spent a couple evenings fighting to figure out how to integrate the Cython build process (using setuptools) with autotools, but it mostly seems to work. Next up I need to make sure I can build the various package flavors. As soon as I'm reasonably confident that the build process won't explode I'll put up a PR for review of the rbd portion of the bindings.

#6 Updated by Josh Durgin over 3 years ago

If you have a prototype of the build changes we can pull them into a branch to try building on several platforms via http://ceph.com/gitbuilder.cgi

#7 Updated by Hector Martin over 3 years ago

I just rebased the changes and updated the bindings based on the latest rbd.py from master, and I'm trying deb and rpm package builds now on Ubuntu 14.04 and CentOS 7. My big unknown is the build system integration, since I'm very much not an autotools guy and getting it to play nicely with setuptools seems kind of iffy. I'll send a branch your way as soon as I confirm that I am able to build vaguely sensible packages out of it.

If you want to see where I'm at, see this branch, but keep in mind that the commit is basically a dummy at this point, the authorship info is wrong, and I'm rewriting history all the time. I'm actively working on this now so expect progress.
https://github.com/marcan/ceph/tree/cython

#8 Updated by Josh Durgin over 3 years ago

I added a couple more things on top of your branch to https://github.com/ceph/ceph/commits/wip-cython-rbd and it builds and passes tests on my fedora system. Looking good! There are a few things to fix with package builds:

It looks like cwd needs to change or an absolute path used for setup.py - both centos 7 and ubuntu trusty fail with:

 cd ./pybind; VERSION_RELEASE="9.2.0" CPPFLAGS="-iquote \/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-9.2.0-863-gc341d59/src/include -D__CEPH__ -D_FILE_OFFSET_BITS=64 -D_THREAD_SAFE -D__STDC_FORMAT_MACROS -D_GNU_SOURCE -DCEPH_LIBDIR=\"/usr/lib\" -DCEPH_PKGLIBDIR=\"/usr/lib/ceph\" -DGTEST_USE_OWN_TR1_TUPLE=0 -D_REENTRANT " CFLAGS="-iquote \/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-9.2.0-863-gc341d59/src/include -Wall -Wtype-limits -Wignored-qualifiers -Winit-self -Wpointer-arith -Werror=format-security -fno-strict-aliasing -fsigned-char -rdynamic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions --param=ssp-buffer-size=4 -fPIE -fstack-protector -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -g -O2" LDFLAGS="-L\/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-9.2.0-863-gc341d59/src/.libs -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -latomic_ops " /usr/bin/python ./setup.py build \
--build-base /srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-9.2.0-863-gc341d59/src/build \
--verbose
/usr/bin/python: can't open file './setup.py': [Errno 2] No such file or directory
make[4]: *** [pybind-all] Error 2 

You can see the full output here:
http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-rpm-centos7-amd64-basic/log.cgi?log=c341d59732a98c3d770007478be71c6192ed26ce
http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-trusty-amd64-basic/log.cgi?log=c341d59732a98c3d770007478be71c6192ed26ce

A couple other things that I hadn't noticed before:

  • the docs gitbuilder will need to build the rbd python module now (sphinx imports it to read the docstrings) - to avoid a full build, maybe we can install a fixed version of librbd on that machine, and just run cythonize in the admin/build-doc script
  • to be used with python 3 it looks like we'd need to create python3-rbd packages (this can wait for a later pr)

#9 Updated by Josh Durgin over 3 years ago

Feel free to squash those commits into your branch btw

#10 Updated by Hector Martin over 3 years ago

I figured out what was up with setup.py. It wasn't about cwd, the issue was that 'make dist' was not including setup.py or rbd.pyx at all, and gitbuilder seems to use that to make a distribution tarball before building it. Pushed a commit to fix that and squashed everything else; feel free to send it off to gitbuilder and see how it fares. Good to know that it builds on Fedora. I'm trying to get a CentOS7 package build done locally now. Debian package builds already succeeded for me (a simple dpkg-buildpackage doesn't go through the make dist step so it didn't hit the problem that gitbuilder did).

For Python 3, yes, we'll need separate packages for the modules (but we would still need different packages for the pure Python code, no? Even if it works on both versions, the files still need to be present in both the py2 and py3 [site/dist]-packages dirs). The Cython source code should not need to change - Cython source and its output (the C source code) is compatible with both versions. For reference, rbd.pyx is written in the Cython language level 2 (so it behaves like Py2 source code).

By the way, let me know if you prefer the final branch to be broken out into somewhat logical commits or if you prefer it in one large commit. It's a mostly atomic change though (can't switch to Cython without making the build system changes at once), so it probably doesn't make sense to have intermediate commits as the intermediate state would not be buildable, but let me know if you prefer otherwise.

Looking at the docs gitbuilder now. I'm not sure about a fixed version of librbd - the Cython bindings will follow the librbd version rather closely, and so would fail to build any time a new librbd symbol is introduced and used in the Python bindings. Since getting docstrings should not require actually making any rbd calls, though, you can get around it with a horrible hack. Something along these lines:

cd src/pybind
echo >dummy_rbd.c
gcc -shared -o librbd.so dummy_rbd.c
CFLAGS="-iquote $PWD/../include" CPPFLAGS="-iquote $PWD/../include" LDFLAGS="-L$PWD -Wl,--no-as-needed" VERSION_RELEASE=1 python setup.py build_ext --inplace
nm rbd.so | grep 'U rbd_' | awk '{ print "void "$2"(void) {}" }' >dummy_rbd.c
gcc -shared -o librbd.so dummy_rbd.c
LD_LIBRARY_PATH=$PWD python -c 'import rbd; help(rbd)'

Is this too evil?

#11 Updated by Hector Martin over 3 years ago

RPM packages seem to build after the latest few commits, and debs should work too unless I broke something. Let me know what you think about the above hack. If that sounds reasonable then once implemented this should be ready for PR.

#12 Updated by Josh Durgin over 3 years ago

Good point about the python3 packages being required regardless of cython.

I'm fine with that docs hack. Avoiding waiting for a mostly full build for docs builds is worth it.

In terms of commits usually I like more smaller commits, but in this case there aren't a whole lot of separable changes, so do whatever you prefer.

I repushed your current branch to the main repo as wip-cython-rbd. You can see the results for ubuntu and centos at http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-trusty-amd64-basic/#origin/wip-cython-rbd and
http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-rpm-centos7-amd64-basic/#origin/wip-cython-rbd

#13 Updated by Josh Durgin over 3 years ago

Looks like the ubuntu build worked. On centos 7 it had a final error when checking the packages:

 Checking for unpackaged file(s): /usr/lib/rpm/check-files /srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILDROOT/ceph-10.0.0-642.g6ad14c4.el7.x86_64

    error: Installed (but unpackaged) file(s) found: 

/usr/lib64/python2.7/site-packages/rbd_install_files.txt 

#14 Updated by Hector Martin over 3 years ago

I got rid of rbd_install_files; it's only useful for "make uninstall" (to know what files to remove) but we don't even have an uninstall target for setup.py stuff anyway. Alternatively I guess there's probably some way to get the RPM build to ignore it, or just not install it/delete it if building an RPM.

I've added the hack to the docs build script. The 'cython' package needs to be installed on the builder box to make it work. Alternatively, it could be added to requirements.txt instead, but that would end up compiling Cython inside the virtualenv every time the docs are built (which, while not as bad as building all of Ceph, still takes more time than is worth).

I've also rebased on top of master, and ran a make check. I also ran nosetests on test_rbd.py manually and that passes. One thing I'm not too sure about is how the QA stuff invokes those tests: if Ceph is installed by that point then it will work, but if it relies on the vstart.sh style environment variables (particularly PYTHONPATH=./pybind) then that will need adjustment, as rbd.so ends up in src/build/lib.linux-x86_64-2.7 by default (or similar). This can be mitigated by using "setup.py build_ext --inplace" but that conflicts with the regular install mechanism. Another way is to just bring up a virtualenv and install the bindings into that.

Another minor change: instead of passing the version via an environment variable from the Makefile to setup.py (which breaks if you try to call it manually), now setup.py (trivially) parses src/ceph_ver.h to extract the version (which is more accurate anyway as it includes git info).

I squashed the changes and then broke them out into 4 commits; 3 of them make preparatory changes that should not break the build (just introduce dependencies/minor refactoring), while the fourth does the big switchover. If that looks good, let me know and I'll make a PR.

#15 Updated by Hector Martin over 3 years ago

Note: just amended and force pushed the last commit a couple times to fix some issues in the interaction between admin/build-doc and setup.py.

#16 Updated by Josh Durgin over 3 years ago

Looks good, definitely seems ready for a PR. I'm upgrading the docs gitbuilder - it was too old to have cython 0.20. Thanks for breaking up the commits like that - it's easier to review that way.

The integration tests use distro packages, so there should be no PYTHONPATH issues there.

#18 Updated by Josh Durgin over 3 years ago

  • Status changed from New to Resolved
  • Assignee set to Hector Martin
  • Source changed from other to Community (dev)

Also available in: Atom PDF