Project

General

Profile

Actions

Bug #504

closed

hang when using radostool

Added by Colin McCabe over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I was adding some objects using radostool, when I got an unexplained hang. It looked like this:

gdb -p 19724

(gdb) bt
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f00037e0583 in ?? () from /home/cmccabe/src/ceph/src/.libs/librados.so.1
#2 0x00007f00037f3742 in ?? () from /home/cmccabe/src/ceph/src/.libs/librados.so.1
#3 0x00007f00037c910b in RadosClient::shutdown (this=0x1a18420) at librados.cc:394
#4 0x00007f00037c9264 in librados::Rados::shutdown (this=0x7fff1aac0590) at librados.cc:1288
#5 0x0000000000414b9e in main (argc=6, argv=0x7fff1aac07a8) at rados.cc:467

I'm not sure whether this is a race inside radosclient/librados itself, or a server failing to respond.

I then ran another instance of radostool and got a different hang.

gdb -p 20614

(gdb) bt
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f78761ce583 in ?? () from /home/cmccabe/src/ceph/src/.libs/librados.so.1
#2 0x00007f78761b3c35 in RadosClient::write_full (this=0xb18420, pool=@0x7f786c0008b0, oid=@0x7fffb9ce0c40, bl=@0x7fffb9ce1140) at librados.cc:897
#3 0x00007f78761b3e1d in librados::Rados::write_full (this=0x7fffb9ce1310, pool=0x7f786c0008b0, o=@0x7fffb9ce1270, bl=@0x7fffb9ce1140) at librados.cc:1409
#4 0x00000000004134fe in main (argc=6, argv=0x7fffb9ce1528) at rados.cc:285

So it appears to have been waiting for a reply from the server.

Then I modified ceph.conf to increase debugging. Specifically, I set:
debug ms = 20
debug objecter = 20
debug monc 20

When I re-ran radostool with these settings, everything worked fine. Subsequent attempts to reproduce the first two hangs failed.

Configuration: vstart.sh with two OSDs. Standard ceph.conf.

Actions #1

Updated by Colin McCabe over 13 years ago

Perhaps 197928c26cec52e0f3f91e930988b1e5767e355b will resolve the radostool shutdown race condition.

The second backtrace seems to be an unrelated problem.

Actions #2

Updated by Sage Weil over 13 years ago

  • Status changed from New to Resolved

The second issue looks like a transient osd issue.

Closing this for now, but we should keep an eye out for it happening again.

Actions

Also available in: Atom PDF