Project

General

Profile

Bug #8797

"ceph status" do not exit with python_2.7.8

Added by Dmitry Smirnov about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
ceph cli
Target version:
-
Start date:
07/09/2014
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
firefly,giant
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

As reported in

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=754341

after upgrade to python 2.7.8 "`ceph -s`" hangs instead of returning to shell.


Related issues

Related to Ceph - Bug #10567: Python Ioctx should retain a reference to the cluster object Resolved 01/18/2015

Associated revisions

Revision 5ba9b8f2 (diff)
Added by Dan Mick over 4 years ago

rados.py: remove Rados.__del__(); it just causes problems

Recent versions of Python contain a change to thread shutdown that
causes ceph to hang on exit; see http://bugs.python.org/issue21963.
As it turns out, this is relatively easy to avoid by not spawning
threads on exit, as Rados.__del__() will certainly do by calling
shutdown(); I suspect, but haven't proven, that the problem is
that shutdown() tries to start() a threading.Thread() that never
makes it all the way back to signal start().

Also add a PendingReleaseNote and extra doc comments to clarify.

Fixes: #8797
Signed-off-by: Dan Mick <>

Revision ed8c9af3 (diff)
Added by Dan Mick over 4 years ago

rados.py: remove Rados.__del__(); it just causes problems

Recent versions of Python contain a change to thread shutdown that
causes ceph to hang on exit; see http://bugs.python.org/issue21963.
As it turns out, this is relatively easy to avoid by not spawning
threads on exit, as Rados.__del__() will certainly do by calling
shutdown(); I suspect, but haven't proven, that the problem is
that shutdown() tries to start() a threading.Thread() that never
makes it all the way back to signal start().

Also add a PendingReleaseNote and extra doc comments to clarify.

Fixes: #8797
Signed-off-by: Dan Mick <>
(cherry picked from commit 5ba9b8f21f8010c59dd84a0ef2acfec99e4b048f)

Conflicts:
PendingReleaseNotes

Revision e00270b5 (diff)
Added by Dan Mick over 4 years ago

rados.py: remove Rados.__del__(); it just causes problems

Recent versions of Python contain a change to thread shutdown that
causes ceph to hang on exit; see http://bugs.python.org/issue21963.
As it turns out, this is relatively easy to avoid by not spawning
threads on exit, as Rados.__del__() will certainly do by calling
shutdown(); I suspect, but haven't proven, that the problem is
that shutdown() tries to start() a threading.Thread() that never
makes it all the way back to signal start().

Also add a PendingReleaseNote and extra doc comments to clarify.

Fixes: #8797
Signed-off-by: Dan Mick <>
(cherry picked from commit 5ba9b8f21f8010c59dd84a0ef2acfec99e4b048f)

Conflicts:
PendingReleaseNotes

History

#1 Updated by Dmitry Smirnov about 5 years ago

  • Priority changed from Normal to High

Looks like librados thread is active (not terminated) in "rados.py" and `ceph` is waiting for it indefinitely... I'm illiterate in Python so I can't analyse further...

#2 Updated by Dmitry Smirnov about 5 years ago

  • Priority changed from High to Normal

This bug prevent Ceph daemons from starting...
Debian "testing" is already affected since python already propagated there...
Please advise.

#3 Updated by Dmitry Smirnov about 5 years ago

  • Priority changed from Normal to High

#4 Updated by Dmitry Smirnov about 5 years ago

Please be advised that this issue appears to be a regression in Python 2.7.8 (see details in the Debian bug report).

#6 Updated by Dan Mick about 5 years ago

Fascinating info so far, Dmitry, thanks for your work on this. Anxious to see what the Python team thinks of the assertion that it's a Python issue.

#7 Updated by Alfredo Deza about 5 years ago

I believe that we should attempt to replicate the problem first as I know the Python ticket will get ignored unless there is a way to reproduce (other than
installing ceph with that Python version and call `ceph -s`)

Currently looking at `rados.py` and the `run_in_thread` function looks like a good candidate to start with:

def run_in_thread(target, args, timeout=0):
    interrupt = False

    countdown = timeout
    t = RadosThread(target, args)

    # allow the main thread to exit (presumably, avoid a join() on this
    # subthread) before this thread terminates.  This allows SIGINT
    # exit of a blocked call.  See below.
    t.daemon = True

    t.start()
    try:
        # poll for thread exit
        while t.is_alive():
            t.join(POLL_TIME_INCR)
            if timeout and t.is_alive():
                countdown = countdown - POLL_TIME_INCR
                if countdown <= 0:
                    raise KeyboardInterrupt

        t.join()        # in case t exits before reaching the join() above
    except KeyboardInterrupt:
        # ..but allow SIGINT to terminate the waiting.  Note: this
        # relies on the Linux kernel behavior of delivering the signal
        # to the main thread in preference to any subthread (all that's
        # strictly guaranteed is that *some* thread that has the signal
        # unblocked will receive it).  But there doesn't seem to be
        # any interface to create t with SIGINT blocked.
        interrupt = True

    if interrupt:
        t.retval = -errno.EINTR
    return t.retval

#8 Updated by Dmitry Smirnov about 5 years ago

For a moment Python maintainer in Debian kindly fixed this issue for us by adding patch to revert problematic change in Python.
However this is a time bomb as it is (potentially) affecting Ceph on all architectures and distributions outside Debian.
Please follow-up with Python developers or make changes for compatibility with Python-2.7.8.
This is a very serious issue because no cluster components can be started with vanilla Python-2.7.8.

#9 Updated by Boris Ranto almost 5 years ago

Just a note that people are hitting this in fedora 21, now:

https://bugzilla.redhat.com/show_bug.cgi?id=1155335

#10 Updated by Dan Mick over 4 years ago

This works around the problem, while also destroying the exit code from the ceph program, so if you rely on that, this won't help, but it will at least let the command exit:

change the last line of ceph, 'sys.exit(main())', to

    main()
    os.kill(os.getpid(), 15)

#11 Updated by Joe Julian over 4 years ago

In order to get the exit code, I tried this:

    result = main()
    del cluster_handle
    sys.exit(result)

Which resulted in a core dump:

Illegal instruction (core dumped)

Which, of course, was the same result if I tried cluster_handle.shutdown() (since Rados.__del__ does the same call).

I was able to work around this by removing the Rados.__del__ function from rados.py. This allowed the thread to be at least abandoned with the above code and sys.exit to conclude.

#12 Updated by Joe Julian over 4 years ago

The SIGILL was cured in master with the application of 92615ea and cf2104d. I've tested backporting these to firefly which allowed shutdown to be called without crashing.

Still need to del cluster_handle to avoid the hang which I think is because it's not guaranteed that threads will garbage collect with sys.exit.

PR #3053

#13 Updated by Dan Mick over 4 years ago

I think the right fix for this is to remove Rados.__del__. I'll come up with a pull request unless you want to, Joe.

#14 Updated by Dan Mick over 4 years ago

  • Assignee set to Dan Mick

#15 Updated by Dan Mick over 4 years ago

  • Backport set to firefly

#16 Updated by Loic Dachary over 4 years ago

  • Status changed from New to Pending Backport

giant also ?

#17 Updated by Loic Dachary over 4 years ago

  • Backport changed from firefly to firefly,giant

#18 Updated by Loic Dachary over 4 years ago

  • Status changed from Pending Backport to Resolved

#21 Updated by Loic Dachary over 4 years ago

  • Status changed from Resolved to Pending Backport

#22 Updated by Loic Dachary over 4 years ago

e00270b rados.py: remove Rados.__del__(); it just causes problems (in firefly), ed8c9af rados.py: remove Rados.__del__(); it just causes problems (in giant),

#23 Updated by Loic Dachary over 4 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF