Bug #504
closedhang when using radostool
0%
Description
I was adding some objects using radostool, when I got an unexplained hang. It looked like this:
gdb -p 19724
(gdb) bt
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f00037e0583 in ?? () from /home/cmccabe/src/ceph/src/.libs/librados.so.1
#2 0x00007f00037f3742 in ?? () from /home/cmccabe/src/ceph/src/.libs/librados.so.1
#3 0x00007f00037c910b in RadosClient::shutdown (this=0x1a18420) at librados.cc:394
#4 0x00007f00037c9264 in librados::Rados::shutdown (this=0x7fff1aac0590) at librados.cc:1288
#5 0x0000000000414b9e in main (argc=6, argv=0x7fff1aac07a8) at rados.cc:467
I'm not sure whether this is a race inside radosclient/librados itself, or a server failing to respond.
I then ran another instance of radostool and got a different hang.
gdb -p 20614
(gdb) bt
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f78761ce583 in ?? () from /home/cmccabe/src/ceph/src/.libs/librados.so.1
#2 0x00007f78761b3c35 in RadosClient::write_full (this=0xb18420, pool=@0x7f786c0008b0, oid=@0x7fffb9ce0c40, bl=@0x7fffb9ce1140) at librados.cc:897
#3 0x00007f78761b3e1d in librados::Rados::write_full (this=0x7fffb9ce1310, pool=0x7f786c0008b0, o=@0x7fffb9ce1270, bl=@0x7fffb9ce1140) at librados.cc:1409
#4 0x00000000004134fe in main (argc=6, argv=0x7fffb9ce1528) at rados.cc:285
So it appears to have been waiting for a reply from the server.
Then I modified ceph.conf to increase debugging. Specifically, I set:
debug ms = 20
debug objecter = 20
debug monc 20
When I re-ran radostool with these settings, everything worked fine. Subsequent attempts to reproduce the first two hangs failed.
Configuration: vstart.sh with two OSDs. Standard ceph.conf.
Updated by Colin McCabe over 13 years ago
Perhaps 197928c26cec52e0f3f91e930988b1e5767e355b will resolve the radostool shutdown race condition.
The second backtrace seems to be an unrelated problem.
Updated by Sage Weil over 13 years ago
- Status changed from New to Resolved
The second issue looks like a transient osd issue.
Closing this for now, but we should keep an eye out for it happening again.