Subtask #1231
closedFeature #801: librados: allow access to multiple clusters
Subtask #815: Remove globals & partition g_conf
NUM_THREADS=3 testrados segfaults
0%
Description
NUM_THREADS=3 ./testrados segfaults most of the time.
It seems to segfault more when the threads are actually interleaved, rather than running one after the other as they sometimes do. (I get a rough feeling for this by examining the order of the printfs.)
#0 0x00007f009437d36a in OSDMap::object_locator_to_pg (this=0x7f008c001c08, oid=..., loc=...) at ./osd/OSDMap.h:778 #1 0x00007f009436da74 in Objecter::recalc_op_target (this=0x7f008c006590, op=0x7f00840020a0) at osdc/Objecter.cc:554 #2 0x00007f009436aa5a in Objecter::handle_osd_map (this=0x7f008c006590, m=0x117f760) at osdc/Objecter.cc:240 #3 0x00007f0094333a31 in librados::RadosClient::_dispatch (this=0x7f008c001be0, m=0x117f760) at librados.cc:958 #4 0x00007f0094333812 in librados::RadosClient::ms_dispatch (this=0x7f008c001be0, m=0x117f760) at librados.cc:921 #5 0x00007f0094462497 in Messenger::ms_deliver_dispatch (this=0x7f008c005850, m=0x117f760) at msg/Messenger.h:101 #6 0x00007f009444c16c in SimpleMessenger::dispatch_entry (this=0x7f008c005850) at msg/SimpleMessenger.cc:356 #7 0x00007f009434552e in SimpleMessenger::DispatchThread::entry (this=0x7f008c005cd8) at msg/SimpleMessenger.h:545 #8 0x00007f0094489af9 in Thread::_entry_func (arg=0x7f008c005cd8) at common/Thread.cc:45 #9 0x00007f0093ef78ba in start_thread (arg=<value optimized out>) at pthread_create.c:300 #10 0x00007f0093c5f02d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #11 0x0000000000000000 in ?? ()
It seems that there is some kind of race in object_locator_to_pg.
Current language: auto The current source language is "auto; currently c++". (gdb) list 773 const pg_pool_t *pool = get_pg_pool(loc.get_pool()); 774 ps_t ps; 775 if (loc.key.length()) 776 ps = ceph_str_hash(pool->v.object_hash, loc.key.c_str(), loc.key.length()); 777 else 778 ps = ceph_str_hash(pool->v.object_hash, oid.name.c_str(), oid.name.length()); 779 780 // mix in preferred osd, so we don't get the same peers for 781 // all of the placement pgs (e.g. 0.0p*) 782 if (loc.get_preferred() >= 0) (gdb) print pool $1 = (const pg_pool_t *) 0x0
What are the worker threads doing? Well one is doing a rados_write:
Thread 10 (Thread 18819): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007f0094342b71 in Cond::Wait (this=0x7f00925219e0, mutex=...) at ./common/Cond.h:46 #2 0x00007f0094335c53 in librados::RadosClient::write (this=0x7f008c001be0, io=..., oid=..., bl=..., len=26, off=0) at librados.cc:1412 #3 0x00007f009433e2d3 in rados_write (io=0x7f0084001f80, o=0x403207 "foo_object", buf=0x7f0092521cd0 "Fri Jun 24 16:40:57 2011\n", len=26, off=0) at librados.cc:3344 #4 0x00000000004022e8 in testrados (tnum=1) at testrados.c:224 #5 0x0000000000402a46 in do_testrados (v=0x1) at testrados.c:319 #6 0x00007f0093ef78ba in start_thread (arg=<value optimized out>) at pthread_create.c:300 #7 0x00007f0093c5f02d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #8 0x0000000000000000 in ?? ()
It looks like the other test worker threads have actually finished.
My first guess is that the rados_pool_create and the rados_pool_delete are racing with each other. I think maybe if you have an io context open referencing a pool, and that pool gets deleted, there is a problem. Perhaps I will write a test case just for that scenario.
Updated by Colin McCabe almost 13 years ago
Wrote a small test, 6a3626d373f42cb1750edbdecd050a3cf0606dd7, that also seems to be exhibiting odd behavior. I think we've had this screwed up the whole time. This needs to be fixed because otherwise it will show up with bucket creation/destruction in DHO...
Updated by Colin McCabe almost 13 years ago
- Status changed from New to Rejected
Moving this into issue #1261