Project

General

Profile

Actions

Bug #8797

closed

"ceph status" do not exit with python_2.7.8

Added by Dmitry Smirnov almost 10 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
ceph cli
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
firefly,giant
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

As reported in

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=754341

after upgrade to python 2.7.8 "`ceph -s`" hangs instead of returning to shell.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #10567: Python Ioctx should retain a reference to the cluster objectResolved01/18/2015

Actions
Actions #1

Updated by Dmitry Smirnov almost 10 years ago

  • Priority changed from Normal to High

Looks like librados thread is active (not terminated) in "rados.py" and `ceph` is waiting for it indefinitely... I'm illiterate in Python so I can't analyse further...

Actions #2

Updated by Dmitry Smirnov almost 10 years ago

  • Priority changed from High to Normal

This bug prevent Ceph daemons from starting...
Debian "testing" is already affected since python already propagated there...
Please advise.

Actions #3

Updated by Dmitry Smirnov almost 10 years ago

  • Priority changed from Normal to High
Actions #4

Updated by Dmitry Smirnov almost 10 years ago

Please be advised that this issue appears to be a regression in Python 2.7.8 (see details in the Debian bug report).

Actions #6

Updated by Dan Mick almost 10 years ago

Fascinating info so far, Dmitry, thanks for your work on this. Anxious to see what the Python team thinks of the assertion that it's a Python issue.

Actions #7

Updated by Alfredo Deza almost 10 years ago

I believe that we should attempt to replicate the problem first as I know the Python ticket will get ignored unless there is a way to reproduce (other than
installing ceph with that Python version and call `ceph -s`)

Currently looking at `rados.py` and the `run_in_thread` function looks like a good candidate to start with:

def run_in_thread(target, args, timeout=0):
    interrupt = False

    countdown = timeout
    t = RadosThread(target, args)

    # allow the main thread to exit (presumably, avoid a join() on this
    # subthread) before this thread terminates.  This allows SIGINT
    # exit of a blocked call.  See below.
    t.daemon = True

    t.start()
    try:
        # poll for thread exit
        while t.is_alive():
            t.join(POLL_TIME_INCR)
            if timeout and t.is_alive():
                countdown = countdown - POLL_TIME_INCR
                if countdown <= 0:
                    raise KeyboardInterrupt

        t.join()        # in case t exits before reaching the join() above
    except KeyboardInterrupt:
        # ..but allow SIGINT to terminate the waiting.  Note: this
        # relies on the Linux kernel behavior of delivering the signal
        # to the main thread in preference to any subthread (all that's
        # strictly guaranteed is that *some* thread that has the signal
        # unblocked will receive it).  But there doesn't seem to be
        # any interface to create t with SIGINT blocked.
        interrupt = True

    if interrupt:
        t.retval = -errno.EINTR
    return t.retval
Actions #8

Updated by Dmitry Smirnov over 9 years ago

For a moment Python maintainer in Debian kindly fixed this issue for us by adding patch to revert problematic change in Python.
However this is a time bomb as it is (potentially) affecting Ceph on all architectures and distributions outside Debian.
Please follow-up with Python developers or make changes for compatibility with Python-2.7.8.
This is a very serious issue because no cluster components can be started with vanilla Python-2.7.8.

Actions #9

Updated by Boris Ranto over 9 years ago

Just a note that people are hitting this in fedora 21, now:

https://bugzilla.redhat.com/show_bug.cgi?id=1155335

Actions #10

Updated by Dan Mick over 9 years ago

This works around the problem, while also destroying the exit code from the ceph program, so if you rely on that, this won't help, but it will at least let the command exit:

change the last line of ceph, 'sys.exit(main())', to

    main()
    os.kill(os.getpid(), 15)
Actions #11

Updated by Joe Julian over 9 years ago

In order to get the exit code, I tried this:

    result = main()
    del cluster_handle
    sys.exit(result)

Which resulted in a core dump:

Illegal instruction (core dumped)

Which, of course, was the same result if I tried cluster_handle.shutdown() (since Rados.__del__ does the same call).

I was able to work around this by removing the Rados.__del__ function from rados.py. This allowed the thread to be at least abandoned with the above code and sys.exit to conclude.

Actions #12

Updated by Joe Julian over 9 years ago

The SIGILL was cured in master with the application of 92615ea and cf2104d. I've tested backporting these to firefly which allowed shutdown to be called without crashing.

Still need to del cluster_handle to avoid the hang which I think is because it's not guaranteed that threads will garbage collect with sys.exit.

PR #3053

Actions #13

Updated by Dan Mick over 9 years ago

I think the right fix for this is to remove Rados.__del__. I'll come up with a pull request unless you want to, Joe.

Actions #14

Updated by Dan Mick over 9 years ago

  • Assignee set to Dan Mick
Actions #15

Updated by Dan Mick over 9 years ago

  • Backport set to firefly
Actions #16

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to Pending Backport

giant also ?

Actions #17

Updated by Loïc Dachary over 9 years ago

  • Backport changed from firefly to firefly,giant
Actions #18

Updated by Loïc Dachary about 9 years ago

  • Status changed from Pending Backport to Resolved
Actions #21

Updated by Loïc Dachary about 9 years ago

  • Status changed from Resolved to Pending Backport
Actions #22

Updated by Loïc Dachary about 9 years ago

e00270b rados.py: remove Rados.__del__(); it just causes problems (in firefly), ed8c9af rados.py: remove Rados.__del__(); it just causes problems (in giant),

Actions #23

Updated by Loïc Dachary about 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF