https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2013-04-26T15:44:35Z
Ceph
rbd - Bug #4827: librbd: use after free of ceph context or something in it
https://tracker.ceph.com/issues/4827?journal_id=21073
2013-04-26T15:44:35Z
Josh Durgin
<ul><li><strong>Subject</strong> changed from <i>librbd: mutex held during destruction</i> to <i>librbd: use after free of ceph context or something in it</i></li></ul><p>Segfaults with different backtraces occurred with and without caching enabled. Unfortunately the first core file is corrupt, but the latter two show different signs of corruption (accessing an invalid ceph::log::SubsystemMap in one and accessing PerfCounters with an invalide CephContext in another).</p>
rbd - Bug #4827: librbd: use after free of ceph context or something in it
https://tracker.ceph.com/issues/4827?journal_id=21122
2013-04-29T10:18:49Z
Josh Durgin
<ul></ul><p>It didn't reproduce with log_max_recent = 1, but without that setting it happened after just 3 tries.</p>
<p>Unfortunately, since the corruption is in CephContext, there's no crash dump at the end of the log.</p>
<p>It looks like there may be reads still being processed by the ObjectCacher after the image is closed:</p>
<pre>
2013-04-29 08:29:06.611159 7fcbd5d2f780 10 objectcacher flush_set on 0x13e9b00 dne
2013-04-29 08:29:06.611161 7fcbd5d2f780 20 librbd::ImageCtx: finished flushing cache
2013-04-29 08:29:06.611162 7fcbd5d2f780 10 objectcacher release_set on 0x13e9b00 dne
2013-04-29 08:29:06.611179 7fcb56ffc700 10 objectcacher flusher finish
2013-04-29 08:29:06.611224 7fcbd5d2f780 20 librbd: close_image 0x1417610
2013-04-29 08:29:06.611563 7fcbcaefc700 7 objectcacher bh_read_finish rbd_data.101649202a25.000000000000023c/e padding 0~32768 with 32768 bytes of zeroes
2013-04-29 08:29:06.611572 7fcbcaefc700 20 objectcacher finishing waiters
2013-04-29 08:29:06.611578 7fcbcaefc700 20 librbdwriteback: aio_cb finished
2013-04-29 08:29:06.611582 7fcbcaefc700 20 librbdwriteback: aio_cb completing
2013-04-29 08:29:06.611584 7fcbcaefc700 7 objectcacher bh_read_finish rbd_data.101649202a25.000000000000023d/e 360448~32768 (bl is 0) returned -2
2013-04-29 08:29:06.611611 7fcbcaefc700 7 objectcacher bh_read_finish rbd_data.101649202a25.000000000000023d/e padding 360448~32768 with 32768 bytes of zeroes
2013-04-29 08:29:06.611617 7fcbcaefc700 20 objectcacher finishing waiters
2013-04-29 08:29:06.611620 7fcbcaefc700 20 librbdwriteback: aio_cb finished
2013-04-29 08:29:06.611623 7fcbcaefc700 20 librbdwriteback: aio_cb completing
2013-04-29 08:29:06.611624 7fcbcaefc700 7 objectcacher bh_read_finish rbd_data.101649202a25.000000000000023c/e 360448~32768 (bl is 0) returned -2
2013-04-29 08:29:06.611631 7fcbcaefc700 7 objectcacher bh_read_finish rbd_data.101649202a25.000000000000023c/e padding 360448~32768 with 32768 bytes of zeroes
2013-04-29 08:29:06.611635 7fcbcaefc700 20 objectcacher finishing waiters
2013-04-29 08:29:06.611643 7fcbcaefc700 20 librbdwriteback: aio_cb finished
2013-04-29 08:29:06.611646 7fcbcaefc700 20 librbdwriteback: aio_cb completing
2013-04-29 08:29:06.611648 7fcbd5d2f780 10 objectcacher release_set 0x13fdad0
2013-04-29 08:29:06.611672 7fcbd5d2f780 10 objectcacher release trimming object[rbd_data.101649202a25.0000000000000025/e oset 0x13fdad0 wr 0/0]
2013-04-29 08:29:06.611675 7fcbd5d2f780 10 objectcacher close_object object[rbd_data.101649202a25.0000000000000025/e oset 0x13fdad0 wr 0/0]
2013-04-29 08:29:06.611728 7fcbd5d2f780 10 objectcacher release trimming object[rbd_data.101649202a25.0000000000000041/e oset 0x13fdad0 wr 0/0]
2013-04-29 08:29:06.611732 7fcbd5d2f780 10 objectcacher close_object object[rbd_data.101649202a25.0000000000000041/e oset 0x13fdad0 wr 0/0]
... a lot more close_object() calls until the end of the log
</pre>
<p>The first case of the corruption we saw was without caching enabled, so it may be a race at a higher level.</p>
rbd - Bug #4827: librbd: use after free of ceph context or something in it
https://tracker.ceph.com/issues/4827?journal_id=21148
2013-04-29T15:29:55Z
Josh Durgin
<ul></ul><p>The wip-rbd-close-image branch contains a potential fix. Running the test in a loop to see if it'll happen again.</p>
rbd - Bug #4827: librbd: use after free of ceph context or something in it
https://tracker.ceph.com/issues/4827?journal_id=21153
2013-04-29T16:44:23Z
Sage Weil
sage@newdream.net
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>7</i></li></ul>
rbd - Bug #4827: librbd: use after free of ceph context or something in it
https://tracker.ceph.com/issues/4827?journal_id=21159
2013-04-29T17:01:06Z
Josh Durgin
<ul><li><strong>Status</strong> changed from <i>7</i> to <i>In Progress</i></li></ul><p>Failed on the 8th try, in a similar way, although without logs.</p>
<p>The ObjectCacher looks like it's been destroyed already again though, as its name is no longer valid.</p>
<pre>
INFO:teuthology.orchestra.run.out:1469 clone 7 order 20 su 32768 sc 11
INFO:teuthology.orchestra.run.out:checking clone #5, image image_client.0-clone5 against file /home/ubuntu/cephtest/archive/fsx-image_client.0-parent6
gdb) bt
#0 ceph::log::SubsystemMap::should_gather (this=0x311cc90, sub=<optimized out>, level=7) at ./log/SubsystemMap.h:64
#1 0x00007f83352a18e4 in ObjectCacher::bh_read_finish (this=0x12fce30, poolid=3, oid=..., start=140288, length=1024, bl=..., r=-2, trust_enoent=true)
at osdc/ObjectCacher.cc:617
#2 0x00007f83352b008f in ObjectCacher::C_ReadFinish::finish (this=0x7f819b12bd60, r=-2) at osdc/ObjectCacher.h:475
#3 0x00007f833526680a in Context::complete (this=0x7f819b12bd60, r=<optimized out>) at ./include/Context.h:41
#4 0x00007f8335297105 in librbd::C_Request::finish (this=0x7f819b12be20, r=-2) at librbd/LibrbdWriteback.cc:55
#5 0x00007f8335295b24 in librbd::context_cb (c=<optimized out>, arg=0x7f819b12be20) at librbd/LibrbdWriteback.cc:35
#6 0x00007f83345ffe5d in librados::C_AioComplete::finish (this=0x7f8320c73b30, r=<optimized out>) at ./librados/AioCompletionImpl.h:171
#7 0x00007f833466d8e0 in Finisher::finisher_thread_entry (this=0x123f168) at common/Finisher.cc:56
#8 0x00007f8333bf7e9a in start_thread (arg=0x7f832a7fc700) at pthread_create.c:308
#9 0x00007f8333f00cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()
(gdb) up
#1 0x00007f83352a18e4 in ObjectCacher::bh_read_finish (this=0x12fce30, poolid=3, oid=..., start=140288, length=1024, bl=..., r=-2, trust_enoent=true)
at osdc/ObjectCacher.cc:617
617 osdc/ObjectCacher.cc: No such file or directory.
(gdb) p *this
$1 = {perfcounter = 0x12fd260, cct = 0x7f83341c5778, writeback_handler = @0x12423b0, name = {static npos = <optimized out>,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x1313b78 "P;1\001"}}, lock = @0x13135f8,
max_dirty = 25165824, target_dirty = 16777216, max_size = 33554432, max_objects = 266, max_dirty_age = {tv = {tv_sec = 1, tv_nsec = 0}}, block_writes_upfront = false,
flush_set_callback = 0, flush_set_callback_arg = 0x0,
objects = {<std::_Vector_base<__gnu_cxx::hash_map<sobject_t, ObjectCacher::Object*, __gnu_cxx::hash<sobject_t>, std::equal_to<sobject_t>, std::allocator<ObjectCacher::Object*> >, std::allocator<__gnu_cxx::hash_map<sobject_t, ObjectCacher::Object*, __gnu_cxx::hash<sobject_t>, std::equal_to<sobject_t>, std::allocator<ObjectCacher::Object*> > > >> = {
_M_impl = {<std::allocator<__gnu_cxx::hash_map<sobject_t, ObjectCacher::Object*, __gnu_cxx::hash<sobject_t>, std::equal_to<sobject_t>, std::allocator<ObjectCacher::Object*> > >> = {<__gnu_cxx::new_allocator<__gnu_cxx::hash_map<sobject_t, ObjectCacher::Object*, __gnu_cxx::hash<sobject_t>, std::equal_to<sobject_t>, std::allocator<ObjectCacher::Object*> > >> = {<No data fields>}, <No data fields>}, _M_start = 0x7f82b004eb00, _M_finish = 0x7f82b004eba0,
_M_end_of_storage = 0x7f82b004eba0}}, <No data fields>}, dirty_bh = {_M_t = {
_M_impl = {<std::allocator<std::_Rb_tree_node<ObjectCacher::BufferHead*> >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<ObjectCacher::BufferHead*> >> = {<No data fields>}, <No data fields>},
_M_key_compare = {<std::binary_function<ObjectCacher::BufferHead*, ObjectCacher::BufferHead*, bool>> = {<No data fields>}, <No data fields>}, _M_header = {
_M_color = std::_S_red, _M_parent = 0x0, _M_left = 0x12fceb8, _M_right = 0x12fceb8}, _M_node_count = 0}}}, bh_lru_dirty = {lru_top = {head = 0x0, tail = 0x0,
len = 0}, lru_bot = {head = 0x0, tail = 0x0, len = 0}, lru_pintail = {head = 0x0, tail = 0x0, len = 0}, lru_num = 0, lru_num_pinned = 0, lru_max = 0,
lru_midpoint = 0.59999999999999998}, bh_lru_rest = {lru_top = {head = 0x0, tail = 0x0, len = 0}, lru_bot = {head = 0x0, tail = 0x0, len = 0}, lru_pintail = {
head = 0x0, tail = 0x0, len = 0}, lru_num = 0, lru_num_pinned = 0, lru_max = 0, lru_midpoint = 0.59999999999999998}, ob_lru = {lru_top = {head = 0x0, tail = 0x0,
len = 0}, lru_bot = {head = 0x0, tail = 0x0, len = 0}, lru_pintail = {head = 0x0, tail = 0x0, len = 0}, lru_num = 0, lru_num_pinned = 0, lru_max = 0,
lru_midpoint = 0.59999999999999998}, flusher_cond = {_vptr.Cond = 0x7f83354d9050, _c = {__data = {__lock = 1, __futex = 134, __total_seq = 18446744073709551615,
__wakeup_seq = 67, __woken_seq = 67, __mutex = 0x1313608, __nwaiters = 0, __broadcast_seq = 1},
__size = "\001\000\000\000\206\000\000\000\377\377\377\377\377\377\377\377C\000\000\000\000\000\000\000C\000\000\000\000\000\000\000\b61\001\000\000\000\000\000\000\000\000\001\000\000", __align = 575525617665}, waiter_mutex = 0x13135f8}, flusher_stop = true, flusher_thread = {<Thread> = {_vptr.Thread = 0x7f8334a9cdd0,
thread_id = 0}, oc = 0x12fce30}, finisher = {cct = 0x123a940, finisher_lock = {name = 0x7f83352c50b7 "Finisher::finisher_lock", id = -1, recursive = false,
lockdep = true, backtrace = false, _m = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __list = {__prev = 0x0,
__next = 0x0}}, __size = '\000' <repeats 16 times>"\377, \377\377\377", '\000' <repeats 19 times>, __align = 0}, nlock = 0, locked_by = 0, cct = 0x0,
logger = 0x0}, finisher_cond = {_vptr.Cond = 0x7f83354d9050, _c = {__data = {__lock = 1, __futex = 2, __total_seq = 18446744073709551615, __wakeup_seq = 1,
__woken_seq = 1, __mutex = 0x12fd078, __nwaiters = 0, __broadcast_seq = 1},
__size = "\001\000\000\000\002\000\000\000\377\377\377\377\377\377\377\377\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000x\320/\001\000\000\000\000\000\000\000\000\001\000\000", __align = 8589934593}, waiter_mutex = 0x12fd068}, finisher_empty_cond = {_vptr.Cond = 0x7f83354d9050, _c = {__data = {__lock = 1,
__futex = 0, __total_seq = 18446744073709551615, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0},
__size = "\001\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377", '\000' <repeats 31 times>, __align = 1}, waiter_mutex = 0x0}, finisher_stop = true,
finisher_running = false, finisher_queue = {<std::_Vector_base<Context*, std::allocator<Context*> >> = {
_M_impl = {<std::allocator<Context*>> = {<__gnu_cxx::new_allocator<Context*>> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0,
_M_end_of_storage = 0x0}}, <No data fields>},
finisher_queue_rval = {<std::_List_base<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > >> = {
_M_impl = {<std::allocator<std::_List_node<std::pair<Context*, int> > >> = {<__gnu_cxx::new_allocator<std::_List_node<std::pair<Context*, int> > >> = {<No data fields>}, <No data fields>}, _M_node = {_M_next = 0x12fd160, _M_prev = 0x12fd160}}}, <No data fields>}, logger = 0x0, finisher_thread = {<Thread> = {
_vptr.Thread = 0x7f8334a9cdd0, thread_id = 0}, fin = 0x12fd060}}, stat_cond = {_vptr.Cond = 0x7f83354d9050, _c = {__data = {__lock = 1, __futex = 0,
__total_seq = 18446744073709551615, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0},
__size = "\001\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377", '\000' <repeats 31 times>, __align = 1}, waiter_mutex = 0x0}, stat_clean = 0,
stat_zero = 0, stat_dirty = 0, stat_rx = 0, stat_tx = 0, stat_missing = 0, stat_error = 0, stat_dirty_waiting = 0}
(gdb) p this->lock
$2 = (Mutex &) @0x13135f8: {name = 0x7f83352c2761 "librbd::ImageCtx::cache_lock", id = -1, recursive = false, lockdep = true, backtrace = false, _m = {__data = {
__lock = 1, __count = 0, __owner = 649, __nusers = 1, __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\001\000\000\000\000\000\000\000\211\002\000\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 1}, nlock = 1, locked_by = 140201330460416,
cct = 0x0, logger = 0x0}
(gdb) p this->name
$3 = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
_M_p = 0x1313b78 "P;1\001"}}
</pre>
rbd - Bug #4827: librbd: use after free of ceph context or something in it
https://tracker.ceph.com/issues/4827?journal_id=21212
2013-04-30T11:22:33Z
Josh Durgin
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>7</i></li></ul><p>The fix might work after all. The test was still running against the next branch since I had specified it in the ceph task instead of the install task. 6 passes so far against the correct branch...</p>
rbd - Bug #4827: librbd: use after free of ceph context or something in it
https://tracker.ceph.com/issues/4827?journal_id=21219
2013-04-30T13:48:15Z
Sage Weil
sage@newdream.net
<ul><li><strong>Status</strong> changed from <i>7</i> to <i>Resolved</i></li></ul><p>commit:c2bcc2a60c2c1f66c757c01ed6bcc6778821f81d</p>