Bug #1977

mon: ceph command hang

Added by Sage Weil almost 12 years ago. Updated almost 12 years ago.

Can't reproduce
Target version:
% Done:


3 - minor
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):



2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia7229998-229
2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to sepia7229998-229
2012-01-24T02:41:21.560 INFO:teuthology.task.rados.rados.0.err:0 errors.
2012-01-24T02:41:21.561 INFO:teuthology.task.rados.rados.0.err:
2012-01-24T02:41:26.583 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x1587ad0>
2012-01-24T02:41:26.583 INFO:teuthology.task.thrashosds:joining thrashosds

i killed the ceph process

ubuntu@sepia74:~$ ps ax|grep out
17240 ?        Ssl    0:00 /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd out 0
18624 pts/0    S+     0:00 grep --color=auto out
ubuntu@sepia74:~$ kill 17240

it was stuck at

Thread 1 (Thread 0x7f028c572760 (LWP 17240)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000004542f2 in Wait (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at ./common/Cond.h:48
#2  do_command (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at tools/
#3  0x00000000004509e9 in main (argc=<value optimized out>, argv=<value optimized out>) at tools/


#1 Updated by Sage Weil almost 12 years ago

hrm.. I didn't manage to reproduce a hang, but I did reproduce a failure. A transient error made a command succeed but the ack got lost, so the client resent.. and then got -EEXIST or -EINVAL because it already happened.

So.. should 'ceph osd out 0' return success or EINVAL if osd 0 is already out? Or should the tool user check the error code/message carefully? :/

#2 Updated by Greg Farnum almost 12 years ago

The proper behavior is more a question of what the command means, I think. I tend to think of them as being an action, rather than a desired state to end in, which makes me want to say the proper behavior is returning -EINVAL.

But that's awfully inconvenient under "transient errors" like this (what kind of transient error, anyway, that's causing persistent trouble?), and I see you already pushed a change. The code contains a simple notification text saying "already out", and I can go with that.

#3 Updated by Sage Weil almost 12 years ago

a new monitor election could do it, or a socket error between the ceph command and monitor.

#4 Updated by Sage Weil almost 12 years ago

Hmm, I wonder if somehow misdiagnosed this, or inadvertantly fixed it: haven't seen this hang in weeks, and it happened several times at the time.

#5 Updated by Greg Farnum almost 12 years ago

Pretty sure you pushed changes the day you filed it (note reference in previous message), although I can't find the exact commit now...unless they're in an unmerged branch?

#6 Updated by Sage Weil almost 12 years ago

  • Priority changed from High to Normal

#7 Updated by Sage Weil almost 12 years ago

  • Status changed from New to Can't reproduce

we can reopen if this ever pops up again

Also available in: Atom PDF