Bug #1977
closedmon: ceph command hang
0%
Description
/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8881
2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia7229998-229 2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to sepia7229998-229 2012-01-24T02:41:21.560 INFO:teuthology.task.rados.rados.0.err:0 errors. 2012-01-24T02:41:21.561 INFO:teuthology.task.rados.rados.0.err: 2012-01-24T02:41:26.583 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x1587ad0> 2012-01-24T02:41:26.583 INFO:teuthology.task.thrashosds:joining thrashosds [hangs]
i killed the ceph process
ubuntu@sepia74:~$ ps ax|grep out 17240 ? Ssl 0:00 /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd out 0 18624 pts/0 S+ 0:00 grep --color=auto out ubuntu@sepia74:~$ kill 17240
it was stuck at
Thread 1 (Thread 0x7f028c572760 (LWP 17240)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00000000004542f2 in Wait (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at ./common/Cond.h:48 #2 do_command (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at tools/common.cc:446 #3 0x00000000004509e9 in main (argc=<value optimized out>, argv=<value optimized out>) at tools/ceph.cc:249
Updated by Sage Weil over 12 years ago
hrm.. I didn't manage to reproduce a hang, but I did reproduce a failure. A transient error made a command succeed but the ack got lost, so the client resent.. and then got -EEXIST or -EINVAL because it already happened.
So.. should 'ceph osd out 0' return success or EINVAL if osd 0 is already out? Or should the tool user check the error code/message carefully? :/
Updated by Greg Farnum over 12 years ago
The proper behavior is more a question of what the command means, I think. I tend to think of them as being an action, rather than a desired state to end in, which makes me want to say the proper behavior is returning -EINVAL.
But that's awfully inconvenient under "transient errors" like this (what kind of transient error, anyway, that's causing persistent trouble?), and I see you already pushed a change. The code contains a simple notification text saying "already out", and I can go with that.
Updated by Sage Weil over 12 years ago
a new monitor election could do it, or a socket error between the ceph command and monitor.
Updated by Sage Weil about 12 years ago
Hmm, I wonder if somehow misdiagnosed this, or inadvertantly fixed it: haven't seen this hang in weeks, and it happened several times at the time.
Updated by Greg Farnum about 12 years ago
Pretty sure you pushed changes the day you filed it (note reference in previous message), although I can't find the exact commit now...unless they're in an unmerged branch?
Updated by Sage Weil about 12 years ago
- Status changed from New to Can't reproduce
we can reopen if this ever pops up again