Bug #1977: mon: ceph command hang - Ceph - Ceph

Actions

Copy link

Bug #1977

closed

mon: ceph command hang

Added by Sage Weil over 12 years ago. Updated about 12 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Monitor

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8881

2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia7229998-229
2012-01-24T02:41:19.920 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to sepia7229998-229
2012-01-24T02:41:21.560 INFO:teuthology.task.rados.rados.0.err:0 errors.
2012-01-24T02:41:21.561 INFO:teuthology.task.rados.rados.0.err:
2012-01-24T02:41:26.583 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x1587ad0>
2012-01-24T02:41:26.583 INFO:teuthology.task.thrashosds:joining thrashosds
[hangs]

i killed the ceph process

ubuntu@sepia74:~$ ps ax|grep out
17240 ?        Ssl    0:00 /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd out 0
18624 pts/0    S+     0:00 grep --color=auto out
ubuntu@sepia74:~$ kill 17240

it was stuck at

Thread 1 (Thread 0x7f028c572760 (LWP 17240)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000004542f2 in Wait (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at ./common/Cond.h:48
#2  do_command (ctx=0xa28f20, cmd=..., bl=..., rbl=...) at tools/common.cc:446
#3  0x00000000004509e9 in main (argc=<value optimized out>, argv=<value optimized out>) at tools/ceph.cc:249

Actions

Copy link

Updated by Sage Weil over 12 years ago

hrm.. I didn't manage to reproduce a hang, but I did reproduce a failure. A transient error made a command succeed but the ack got lost, so the client resent.. and then got -EEXIST or -EINVAL because it already happened.

So.. should 'ceph osd out 0' return success or EINVAL if osd 0 is already out? Or should the tool user check the error code/message carefully? :/

Actions

Copy link

Updated by Greg Farnum over 12 years ago

The proper behavior is more a question of what the command means, I think. I tend to think of them as being an action, rather than a desired state to end in, which makes me want to say the proper behavior is returning -EINVAL.

But that's awfully inconvenient under "transient errors" like this (what kind of transient error, anyway, that's causing persistent trouble?), and I see you already pushed a change. The code contains a simple notification text saying "already out", and I can go with that.

Actions

Copy link

Updated by Sage Weil over 12 years ago

a new monitor election could do it, or a socket error between the ceph command and monitor.

Actions

Copy link

Updated by Sage Weil about 12 years ago

Hmm, I wonder if somehow misdiagnosed this, or inadvertantly fixed it: haven't seen this hang in weeks, and it happened several times at the time.

Actions

Copy link

Updated by Greg Farnum about 12 years ago

Pretty sure you pushed changes the day you filed it (note reference in previous message), although I can't find the exact commit now...unless they're in an unmerged branch?

Actions

Copy link