Project

General

Profile

Actions

Bug #1977

closed

mon: ceph command hang

Added by Sage Weil over 12 years ago. Updated about 12 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8881

2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia7229998-229
2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to sepia7229998-229
2012-01-24T02:41:21.560 INFO:teuthology.task.rados.rados.0.err:0 errors.
2012-01-24T02:41:21.561 INFO:teuthology.task.rados.rados.0.err:
2012-01-24T02:41:26.583 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x1587ad0>
2012-01-24T02:41:26.583 INFO:teuthology.task.thrashosds:joining thrashosds
[hangs]

i killed the ceph process

ubuntu@sepia74:~$ ps ax|grep out
17240 ?        Ssl    0:00 /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd out 0
18624 pts/0    S+     0:00 grep --color=auto out
ubuntu@sepia74:~$ kill 17240

it was stuck at

Thread 1 (Thread 0x7f028c572760 (LWP 17240)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000004542f2 in Wait (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at ./common/Cond.h:48
#2  do_command (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at tools/common.cc:446
#3  0x00000000004509e9 in main (argc=<value optimized out>, argv=<value optimized out>) at tools/ceph.cc:249

Actions #1

Updated by Sage Weil over 12 years ago

hrm.. I didn't manage to reproduce a hang, but I did reproduce a failure. A transient error made a command succeed but the ack got lost, so the client resent.. and then got -EEXIST or -EINVAL because it already happened.

So.. should 'ceph osd out 0' return success or EINVAL if osd 0 is already out? Or should the tool user check the error code/message carefully? :/

Actions #2

Updated by Greg Farnum over 12 years ago

The proper behavior is more a question of what the command means, I think. I tend to think of them as being an action, rather than a desired state to end in, which makes me want to say the proper behavior is returning -EINVAL.

But that's awfully inconvenient under "transient errors" like this (what kind of transient error, anyway, that's causing persistent trouble?), and I see you already pushed a change. The code contains a simple notification text saying "already out", and I can go with that.

Actions #3

Updated by Sage Weil over 12 years ago

a new monitor election could do it, or a socket error between the ceph command and monitor.

Actions #4

Updated by Sage Weil about 12 years ago

Hmm, I wonder if somehow misdiagnosed this, or inadvertantly fixed it: haven't seen this hang in weeks, and it happened several times at the time.

Actions #5

Updated by Greg Farnum about 12 years ago

Pretty sure you pushed changes the day you filed it (note reference in previous message), although I can't find the exact commit now...unless they're in an unmerged branch?

Actions #6

Updated by Sage Weil about 12 years ago

  • Priority changed from High to Normal
Actions #7

Updated by Sage Weil about 12 years ago

  • Status changed from New to Can't reproduce

we can reopen if this ever pops up again

Actions

Also available in: Atom PDF