Bug #8797
closed"ceph status" do not exit with python_2.7.8
Added by Dmitry Smirnov almost 10 years ago. Updated about 9 years ago.
0%
Description
As reported in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=754341
after upgrade to python 2.7.8 "`ceph -s`" hangs instead of returning to shell.
Updated by Dmitry Smirnov almost 10 years ago
- Priority changed from Normal to High
Looks like librados thread is active (not terminated) in "rados.py" and `ceph` is waiting for it indefinitely... I'm illiterate in Python so I can't analyse further...
Updated by Dmitry Smirnov almost 10 years ago
- Priority changed from High to Normal
This bug prevent Ceph daemons from starting...
Debian "testing" is already affected since python already propagated there...
Please advise.
Updated by Dmitry Smirnov almost 10 years ago
- Priority changed from Normal to High
Updated by Dmitry Smirnov almost 10 years ago
Please be advised that this issue appears to be a regression in Python 2.7.8 (see details in the Debian bug report).
Updated by Dan Mick almost 10 years ago
Fascinating info so far, Dmitry, thanks for your work on this. Anxious to see what the Python team thinks of the assertion that it's a Python issue.
Updated by Alfredo Deza almost 10 years ago
I believe that we should attempt to replicate the problem first as I know the Python ticket will get ignored unless there is a way to reproduce (other than
installing ceph with that Python version and call `ceph -s`)
Currently looking at `rados.py` and the `run_in_thread` function looks like a good candidate to start with:
def run_in_thread(target, args, timeout=0): interrupt = False countdown = timeout t = RadosThread(target, args) # allow the main thread to exit (presumably, avoid a join() on this # subthread) before this thread terminates. This allows SIGINT # exit of a blocked call. See below. t.daemon = True t.start() try: # poll for thread exit while t.is_alive(): t.join(POLL_TIME_INCR) if timeout and t.is_alive(): countdown = countdown - POLL_TIME_INCR if countdown <= 0: raise KeyboardInterrupt t.join() # in case t exits before reaching the join() above except KeyboardInterrupt: # ..but allow SIGINT to terminate the waiting. Note: this # relies on the Linux kernel behavior of delivering the signal # to the main thread in preference to any subthread (all that's # strictly guaranteed is that *some* thread that has the signal # unblocked will receive it). But there doesn't seem to be # any interface to create t with SIGINT blocked. interrupt = True if interrupt: t.retval = -errno.EINTR return t.retval
Updated by Dmitry Smirnov over 9 years ago
For a moment Python maintainer in Debian kindly fixed this issue for us by adding patch to revert problematic change in Python.
However this is a time bomb as it is (potentially) affecting Ceph on all architectures and distributions outside Debian.
Please follow-up with Python developers or make changes for compatibility with Python-2.7.8.
This is a very serious issue because no cluster components can be started with vanilla Python-2.7.8.
Updated by Boris Ranto over 9 years ago
Just a note that people are hitting this in fedora 21, now:
Updated by Dan Mick over 9 years ago
This works around the problem, while also destroying the exit code from the ceph program, so if you rely on that, this won't help, but it will at least let the command exit:
change the last line of ceph, 'sys.exit(main())', to
main() os.kill(os.getpid(), 15)
Updated by Joe Julian over 9 years ago
In order to get the exit code, I tried this:
result = main() del cluster_handle sys.exit(result)
Which resulted in a core dump:
Illegal instruction (core dumped)
Which, of course, was the same result if I tried cluster_handle.shutdown() (since Rados.__del__ does the same call).
I was able to work around this by removing the Rados.__del__ function from rados.py. This allowed the thread to be at least abandoned with the above code and sys.exit to conclude.
Updated by Joe Julian over 9 years ago
The SIGILL was cured in master with the application of 92615ea and cf2104d. I've tested backporting these to firefly which allowed shutdown to be called without crashing.
Still need to del cluster_handle to avoid the hang which I think is because it's not guaranteed that threads will garbage collect with sys.exit.
PR #3053
Updated by Dan Mick over 9 years ago
I think the right fix for this is to remove Rados.__del__. I'll come up with a pull request unless you want to, Joe.
Updated by Loïc Dachary over 9 years ago
- Backport changed from firefly to firefly,giant
Updated by Loïc Dachary about 9 years ago
- Status changed from Pending Backport to Resolved
Updated by Loïc Dachary about 9 years ago
- merged in master with https://github.com/ceph/ceph/pull/3119
Updated by Loïc Dachary about 9 years ago
- merged in giant by https://github.com/ceph/ceph/pull/3168
Updated by Loïc Dachary about 9 years ago
- Status changed from Resolved to Pending Backport
Updated by Loïc Dachary about 9 years ago
Updated by Loïc Dachary about 9 years ago
- Status changed from Pending Backport to Resolved