Subtask #1231: NUM_THREADS=3 testrados segfaults - Ceph - Ceph

Actions

Copy link

Subtask #1231

closed

Feature #801: librados: allow access to multiple clusters

Subtask #815: Remove globals & partition g_conf

NUM_THREADS=3 testrados segfaults

Added by Colin McCabe almost 13 years ago. Updated almost 13 years ago.

Status:

Rejected

Priority:

High

Assignee:

Colin McCabe

Category:

Target version:

v0.31

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

NUM_THREADS=3 ./testrados segfaults most of the time.

It seems to segfault more when the threads are actually interleaved, rather than running one after the other as they sometimes do. (I get a rough feeling for this by examining the order of the printfs.)

#0  0x00007f009437d36a in OSDMap::object_locator_to_pg (this=0x7f008c001c08, oid=..., loc=...) at ./osd/OSDMap.h:778
#1  0x00007f009436da74 in Objecter::recalc_op_target (this=0x7f008c006590, op=0x7f00840020a0) at osdc/Objecter.cc:554
#2  0x00007f009436aa5a in Objecter::handle_osd_map (this=0x7f008c006590, m=0x117f760) at osdc/Objecter.cc:240
#3  0x00007f0094333a31 in librados::RadosClient::_dispatch (this=0x7f008c001be0, m=0x117f760) at librados.cc:958
#4  0x00007f0094333812 in librados::RadosClient::ms_dispatch (this=0x7f008c001be0, m=0x117f760) at librados.cc:921
#5  0x00007f0094462497 in Messenger::ms_deliver_dispatch (this=0x7f008c005850, m=0x117f760) at msg/Messenger.h:101
#6  0x00007f009444c16c in SimpleMessenger::dispatch_entry (this=0x7f008c005850) at msg/SimpleMessenger.cc:356
#7  0x00007f009434552e in SimpleMessenger::DispatchThread::entry (this=0x7f008c005cd8) at msg/SimpleMessenger.h:545
#8  0x00007f0094489af9 in Thread::_entry_func (arg=0x7f008c005cd8) at common/Thread.cc:45
#9  0x00007f0093ef78ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#10 0x00007f0093c5f02d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#11 0x0000000000000000 in ?? ()

It seems that there is some kind of race in object_locator_to_pg.

Current language:  auto
The current source language is "auto; currently c++".
(gdb) list
773         const pg_pool_t *pool = get_pg_pool(loc.get_pool());
774         ps_t ps;
775         if (loc.key.length())
776           ps = ceph_str_hash(pool->v.object_hash, loc.key.c_str(), loc.key.length());
777         else
778           ps = ceph_str_hash(pool->v.object_hash, oid.name.c_str(), oid.name.length());
779
780         // mix in preferred osd, so we don't get the same peers for
781         // all of the placement pgs (e.g. 0.0p*)
782         if (loc.get_preferred() >= 0)

(gdb) print pool
$1 = (const pg_pool_t *) 0x0

What are the worker threads doing? Well one is doing a rados_write:


Thread 10 (Thread 18819):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f0094342b71 in Cond::Wait (this=0x7f00925219e0, mutex=...) at ./common/Cond.h:46
#2  0x00007f0094335c53 in librados::RadosClient::write (this=0x7f008c001be0, io=..., oid=..., bl=..., len=26, off=0) at librados.cc:1412
#3  0x00007f009433e2d3 in rados_write (io=0x7f0084001f80, o=0x403207 "foo_object", buf=0x7f0092521cd0 "Fri Jun 24 16:40:57 2011\n", len=26, off=0) at librados.cc:3344
#4  0x00000000004022e8 in testrados (tnum=1) at testrados.c:224
#5  0x0000000000402a46 in do_testrados (v=0x1) at testrados.c:319
#6  0x00007f0093ef78ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#7  0x00007f0093c5f02d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#8  0x0000000000000000 in ?? ()

It looks like the other test worker threads have actually finished.

My first guess is that the rados_pool_create and the rados_pool_delete are racing with each other. I think maybe if you have an io context open referencing a pool, and that pool gets deleted, there is a problem. Perhaps I will write a test case just for that scenario.

Actions

Copy link

Updated by Colin McCabe almost 13 years ago

Wrote a small test, 6a3626d373f42cb1750edbdecd050a3cf0606dd7, that also seems to be exhibiting odd behavior. I think we've had this screwed up the whole time. This needs to be fixed because otherwise it will show up with bucket creation/destruction in DHO...

Actions

Copy link

Updated by Colin McCabe almost 13 years ago

Status changed from New to Rejected

Moving this into issue #1261

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Subtask #1231

NUM_THREADS=3 testrados segfaults

Updated by Colin McCabe almost 13 years ago

Updated by Colin McCabe almost 13 years ago