Bug #1977
mon: ceph command hang
0%
Description
/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8881
2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia7229998-229 2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to sepia7229998-229 2012-01-24T02:41:21.560 INFO:teuthology.task.rados.rados.0.err:0 errors. 2012-01-24T02:41:21.561 INFO:teuthology.task.rados.rados.0.err: 2012-01-24T02:41:26.583 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x1587ad0> 2012-01-24T02:41:26.583 INFO:teuthology.task.thrashosds:joining thrashosds [hangs]
i killed the ceph process
ubuntu@sepia74:~$ ps ax|grep out 17240 ? Ssl 0:00 /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd out 0 18624 pts/0 S+ 0:00 grep --color=auto out ubuntu@sepia74:~$ kill 17240
it was stuck at
Thread 1 (Thread 0x7f028c572760 (LWP 17240)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00000000004542f2 in Wait (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at ./common/Cond.h:48 #2 do_command (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at tools/common.cc:446 #3 0x00000000004509e9 in main (argc=<value optimized out>, argv=<value optimized out>) at tools/ceph.cc:249
History
#1 Updated by Sage Weil almost 12 years ago
hrm.. I didn't manage to reproduce a hang, but I did reproduce a failure. A transient error made a command succeed but the ack got lost, so the client resent.. and then got -EEXIST or -EINVAL because it already happened.
So.. should 'ceph osd out 0' return success or EINVAL if osd 0 is already out? Or should the tool user check the error code/message carefully? :/
#2 Updated by Greg Farnum almost 12 years ago
The proper behavior is more a question of what the command means, I think. I tend to think of them as being an action, rather than a desired state to end in, which makes me want to say the proper behavior is returning -EINVAL.
But that's awfully inconvenient under "transient errors" like this (what kind of transient error, anyway, that's causing persistent trouble?), and I see you already pushed a change. The code contains a simple notification text saying "already out", and I can go with that.
#3 Updated by Sage Weil almost 12 years ago
a new monitor election could do it, or a socket error between the ceph command and monitor.
#4 Updated by Sage Weil almost 12 years ago
Hmm, I wonder if somehow misdiagnosed this, or inadvertantly fixed it: haven't seen this hang in weeks, and it happened several times at the time.
#5 Updated by Greg Farnum almost 12 years ago
Pretty sure you pushed changes the day you filed it (note reference in previous message), although I can't find the exact commit now...unless they're in an unmerged branch?
#6 Updated by Sage Weil almost 12 years ago
- Priority changed from High to Normal
#7 Updated by Sage Weil almost 12 years ago
- Status changed from New to Can't reproduce
we can reopen if this ever pops up again