Project

General

Profile

Bug #5390

ceph-deploy osd create hangs

Added by Da Chun Wu over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On Ubuntu 13.04 with ceph 0.61.3 .
It hangs when creating a new osd using ceph-deploy.

ceph@ceph-node4:~/mycluster$ ceph-deploy disk zap ceph-node4:sdd
ceph@ceph-node4:~/mycluster$ ceph-deploy disk zap ceph-node4:sdb
ceph@ceph-node4:~/mycluster$ ceph-deploy osd create ceph-node4:sdb:sdd
^CTraceback (most recent call last):
File "/home/ceph/ceph-deploy/ceph-deploy", line 9, in <module>
load_entry_point('ceph-deploy==0.1', 'console_scripts', 'ceph-deploy')()
File "/home/ceph/ceph-deploy/ceph_deploy/cli.py", line 112, in main
return args.func(args)
File "/home/ceph/ceph-deploy/ceph_deploy/osd.py", line 425, in osd
prepare(args, cfg, activate_prepared_disk=True)
File "/home/ceph/ceph-deploy/ceph_deploy/osd.py", line 265, in prepare
dmcrypt_dir=args.dmcrypt_key_dir,
File "/home/ceph/ceph-deploy/virtualenv/local/lib/python2.7/site-packages/pushy-0.5.1-py2.7.egg/pushy/protocol/proxy.py", line 255, in <lambda>
(conn.operator(type_, self, args, kwargs))
File "/home/ceph/ceph-deploy/virtualenv/local/lib/python2.7/site-packages/pushy-0.5.1-py2.7.egg/pushy/protocol/connection.py", line 66, in operator
return self.send_request(type_, (object, args, kwargs))
File "/home/ceph/ceph-deploy/virtualenv/local/lib/python2.7/site-packages/pushy-0.5.1-py2.7.egg/pushy/protocol/baseconnection.py", line 315, in send_request
m = self.__waitForResponse(handler)
File "/home/ceph/ceph-deploy/virtualenv/local/lib/python2.7/site-packages/pushy-0.5.1-py2.7.egg/pushy/protocol/baseconnection.py", line 412, in _waitForResponse
self.
_processing_condition.wait()
File "/usr/lib/python2.7/threading.py", line 339, in wait
waiter.acquire()

ps aux | grep ceph
ceph 4015 0.0 1.1 118412 11404 pts/1 Sl+ 20:51 0:00 /home/ceph/ceph-deploy/virtualenv/bin/python /home/ceph/ceph-deploy/ceph-deploy osd create ceph-node4:sdb:sdd
root 4043 0.0 0.0 4444 628 pts/1 S+ 20:51 0:00 /bin/sh /usr/sbin/ceph-disk-prepare -- /dev/sdb /dev/sdd
root 4049 0.1 0.9 43216 9876 pts/1 S+ 20:51 0:00 /usr/bin/python /usr/sbin/ceph-disk prepare -- /dev/sdb /dev/sdd

As I said in the mailing list, the root cause has been found:

Previously I ran "ceph-deploy osd create ceph-node4:sdb" by mistake. I terminated it by "Ctrl-c". Therefore the lock on /var/lib/ceph/tmp/ceph-disk.prepare.lock.lock was not released.
So the next "ceph-deploy osd create" was hanging waiting for the lock.

It's a user error, but not easy to be located.

To avoid this problem, maybe we can catch SIGINT in the command ceph-disk:
import signal
import sys
def signal_handler(signal, frame):
prepare_lock.release()
sys.exit(0)
....
signal.signal(signal.SIGINT, signal_handler)

Or at least, for better problem determination, IMHO, a meaningful error message should be prompted by "ceph-deploy osd prepare" instead of running until hang.


Related issues

Related to devops - Bug #5387: ceph-disk: lockfile does not detect stale locks (dead parent process) Resolved 06/17/2013

History

#1 Updated by Sage Weil over 10 years ago

  • Status changed from New to In Progress
  • Priority changed from Normal to High

see also #5387. and i'll add the sigint handler to reduce the probability of this happening!

#2 Updated by Sage Weil over 10 years ago

  • Status changed from In Progress to Fix Under Review
  • Priority changed from High to Urgent

care to review teh top patch in wip-ceph-disk?

alternatively, do you know of a replacement for lockfile that will detect when the owning pid is not running? this would be more robust...

#3 Updated by Sage Weil over 10 years ago

  • Assignee set to Sage Weil

#4 Updated by Sage Weil over 10 years ago

starting with the mercurial lock implementation, which uses a pid. see wip-ceph-disk-lock, tho still incomplete.

#5 Updated by Sage Weil over 10 years ago

bah, trivial fcntl(2) is all we need here.

#6 Updated by Sage Weil over 10 years ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Sage Weil over 10 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF