Bug #8797: "ceph status" do not exit with python_2.7.8 - Ceph - Ceph

Custom queries

Backports: mimic
Backports: missing release
Backports: nautilus
Bluestore
Bug queue
Bug queue - no subprojects
Bug triage
Ceph backlog
Crash queue
Crash triage
Feature Requests
Feedback
My issues
Need Review
Pending backports
Priority queue
Product Backlog Scrub
Project Triage
Test Failures

Actions

Copy link

Bug #8797

closed

"ceph status" do not exit with python_2.7.8

Added by Dmitry Smirnov almost 10 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

High

Assignee:

Dan Mick

Category:

ceph cli

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

firefly,giant

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

As reported in

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=754341

after upgrade to python 2.7.8 "`ceph -s`" hangs instead of returning to shell.

Related issues 1 (0 open — 1 closed)

Related to Ceph - Bug #10567: Python Ioctx should retain a reference to the cluster object

Resolved

01/18/2015

Actions

Issue # Delay: days Cancel

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Dmitry Smirnov almost 10 years ago

Priority changed from Normal to High

Looks like librados thread is active (not terminated) in "rados.py" and `ceph` is waiting for it indefinitely... I'm illiterate in Python so I can't analyse further...

Actions

Copy link

Updated by Dmitry Smirnov almost 10 years ago

Priority changed from High to Normal

This bug prevent Ceph daemons from starting...
Debian "testing" is already affected since python already propagated there...
Please advise.

Actions

Copy link

Updated by Dmitry Smirnov almost 10 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Dmitry Smirnov almost 10 years ago

Please be advised that this issue appears to be a regression in Python 2.7.8 (see details in the Debian bug report).

Actions

Copy link

Updated by Dmitry Smirnov almost 10 years ago

http://bugs.python.org/issue21963

Actions

Copy link

Updated by Dan Mick almost 10 years ago

Fascinating info so far, Dmitry, thanks for your work on this. Anxious to see what the Python team thinks of the assertion that it's a Python issue.

Actions

Copy link

Updated by Alfredo Deza almost 10 years ago

I believe that we should attempt to replicate the problem first as I know the Python ticket will get ignored unless there is a way to reproduce (other than
installing ceph with that Python version and call `ceph -s`)

Currently looking at `rados.py` and the `run_in_thread` function looks like a good candidate to start with:

def run_in_thread(target, args, timeout=0):
    interrupt = False

    countdown = timeout
    t = RadosThread(target, args)

    # allow the main thread to exit (presumably, avoid a join() on this
    # subthread) before this thread terminates.  This allows SIGINT
    # exit of a blocked call.  See below.
    t.daemon = True

    t.start()
    try:
        # poll for thread exit
        while t.is_alive():
            t.join(POLL_TIME_INCR)
            if timeout and t.is_alive():
                countdown = countdown - POLL_TIME_INCR
                if countdown <= 0:
                    raise KeyboardInterrupt

        t.join()        # in case t exits before reaching the join() above
    except KeyboardInterrupt:
        # ..but allow SIGINT to terminate the waiting.  Note: this
        # relies on the Linux kernel behavior of delivering the signal
        # to the main thread in preference to any subthread (all that's
        # strictly guaranteed is that *some* thread that has the signal
        # unblocked will receive it).  But there doesn't seem to be
        # any interface to create t with SIGINT blocked.
        interrupt = True

    if interrupt:
        t.retval = -errno.EINTR
    return t.retval

Actions

Copy link

Updated by Dmitry Smirnov over 9 years ago

For a moment Python maintainer in Debian kindly fixed this issue for us by adding patch to revert problematic change in Python.
However this is a time bomb as it is (potentially) affecting Ceph on all architectures and distributions outside Debian.
Please follow-up with Python developers or make changes for compatibility with Python-2.7.8.
This is a very serious issue because no cluster components can be started with vanilla Python-2.7.8.

Actions

Copy link

Updated by Boris Ranto over 9 years ago

Just a note that people are hitting this in fedora 21, now:

https://bugzilla.redhat.com/show_bug.cgi?id=1155335

Actions

Copy link

#10

Updated by Dan Mick over 9 years ago

This works around the problem, while also destroying the exit code from the ceph program, so if you rely on that, this won't help, but it will at least let the command exit:

change the last line of ceph, 'sys.exit(main())', to

    main()
    os.kill(os.getpid(), 15)

Actions

Copy link

#11

Updated by Joe Julian over 9 years ago

In order to get the exit code, I tried this:

    result = main()
    del cluster_handle
    sys.exit(result)

Which resulted in a core dump:

Illegal instruction (core dumped)

Which, of course, was the same result if I tried cluster_handle.shutdown() (since Rados.__del__ does the same call).

I was able to work around this by removing the Rados.__del__ function from rados.py. This allowed the thread to be at least abandoned with the above code and sys.exit to conclude.

Actions

Copy link

#12

Updated by Joe Julian over 9 years ago

The SIGILL was cured in master with the application of 92615ea and cf2104d. I've tested backporting these to firefly which allowed shutdown to be called without crashing.

Still need to del cluster_handle to avoid the hang which I think is because it's not guaranteed that threads will garbage collect with sys.exit.

PR #3053

Actions

Copy link

#13

Updated by Dan Mick over 9 years ago

I think the right fix for this is to remove Rados.__del__. I'll come up with a pull request unless you want to, Joe.

Actions

Copy link

#14

Updated by Dan Mick over 9 years ago

Assignee set to Dan Mick

Actions

Copy link

#15

Updated by Dan Mick over 9 years ago

Backport set to firefly

Actions

Copy link

#16

Updated by Loïc Dachary over 9 years ago

Status changed from New to Pending Backport

giant also ?

Actions

Copy link

#17

Updated by Loïc Dachary over 9 years ago

Backport changed from firefly to firefly,giant

Actions

Copy link

#18

Updated by Loïc Dachary about 9 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

#19

Updated by Loïc Dachary about 9 years ago

merged in master with https://github.com/ceph/ceph/pull/3119

Actions

Copy link

#20

Updated by Loïc Dachary about 9 years ago

merged in giant by https://github.com/ceph/ceph/pull/3168

Actions

Copy link

#21

Updated by Loïc Dachary about 9 years ago

Status changed from Resolved to Pending Backport

Actions

Copy link

#22

Updated by Loïc Dachary about 9 years ago

e00270b rados.py: remove Rados.__del__(); it just causes problems (in firefly), ed8c9af rados.py: remove Rados.__del__(); it just causes problems (in giant),

Actions

Copy link

#23

Updated by Loïc Dachary about 9 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #8797

"ceph status" do not exit with python_2.7.8

Updated by Dmitry Smirnov almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Dan Mick almost 10 years ago

Updated by Alfredo Deza almost 10 years ago

Updated by Dmitry Smirnov over 9 years ago

Updated by Boris Ranto over 9 years ago

Updated by Dan Mick over 9 years ago

Updated by Joe Julian over 9 years ago

Updated by Joe Julian over 9 years ago

Updated by Dan Mick over 9 years ago

Updated by Dan Mick over 9 years ago

Updated by Dan Mick over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago