Bug #3611
closedrbd.py: segfault with many snapshots
0%
Description
From the nightly python api tests, test_many_snaps failed with a segfault in all runs.
Logs are in:
ubuntu@teuthology:/a/teuthology-2012-12-11_19:00:03-regression-next-testing-basic/{11886,11879,11872,11869}
Updated by Josh Durgin over 11 years ago
Without lockdep, I could not reproduce a crash.
Running with lockdep enabled results in this backtrace:
common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int)' thread 7f9226f70710 time 2012-12-12 14:47:24.851828 common/lockdep.cc: 181: FAILED assert(0) ceph version 0.55-232-g54618af (54618afab2d1634e1fac218a021a07c3905e5c20) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f922c228459] 2: (lockdep_will_lock(char const*, int)+0x435) [0x7f922c3fad48] 3: (Mutex::_will_lock()+0x24) [0x7f922c2126ae] 4: (Mutex::Lock(bool)+0x41) [0x7f922c2124c5] 5: (librbd::AioCompletion::get_return_value()+0x1d) [0x7f92264c51db] 6: (librbd::rbd_ctx_cb(void*, void*)+0x3b) [0x7f92264f9665] 7: (librbd::AioCompletion::complete()+0xd6) [0x7f92264c8246] 8: (librbd::AioCompletion::complete_request(CephContext*, long)+0x2ad) [0x7f92264c7777] 9: (librbd::C_AioWrite::finish(int)+0x30) [0x7f92264fe834] 10: (Context::complete(int)+0x2b) [0x7f92264c8045] 11: (librbd::AioRequest::complete(int)+0x6f) [0x7f92264c80d5] 12: (librbd::rados_req_cb(void*, void*)+0x34) [0x7f92264fb075] 13: (librados::C_AioSafe::finish(int)+0x4d) [0x7f922c1ab697] 14: (Finisher::finisher_thread_entry()+0x389) [0x7f922c227507] 15: (Finisher::FinisherThread::entry()+0x1c) [0x7f922c191f74] 16: (Thread::_entry_func(void*)+0x23) [0x7f922c3e0759] 17: (()+0x68ba) [0x7f922ee5b8ba] 18: (clone()+0x6d) [0x7f922dd2d02d]
This lock is a recursive lock, which lockdep is generating a false positive for.
Updated by Josh Durgin over 11 years ago
- Priority changed from Urgent to Normal
Downgrading priority since this isn't an actual bug.
Updated by Josh Durgin over 11 years ago
Finally got a backtrace. It seems something is overwriting a Mutex::Locker on the stack:
#0 0x00007f3bec3f46ed in Mutex::_pre_unlock (this=0x180) at ./common/Mutex.h:96 #1 0x00007f3bec3f47ba in Mutex::Unlock (this=0x180) at ./common/Mutex.h:105 #2 0x00007f3bec3fb429 in Mutex::Locker::~Locker (this=0x7fff3b6d9310, __in_chrg=<optimized out>) at ./common/Mutex.h:123 #3 0x00007f3bec4202d3 in librbd::ictx_refresh (ictx=0x30da440) at librbd/internal.cc:1677 #4 0x00007f3bec41ea0c in librbd::ictx_check (ictx=0x30da440) at librbd/internal.cc:1509 #5 0x00007f3bec412941 in librbd::snap_remove (ictx=0x30da440, snap_name=0x2fff6e4 "106") at librbd/internal.cc:467 #6 0x00007f3bec3f3076 in rbd_snap_remove (image=0x30da440, snap_name=0x2fff6e4 "106") at librbd/librbd.cc:748 #7 0x00007f3bf1137ea4 in ffi_call_unix64 () from /usr/lib/python2.7/lib-dynload/_ctypes.so #8 0x00007f3bf11378c5 in ffi_call () from /usr/lib/python2.7/lib-dynload/_ctypes.so #9 0x00007f3bf11282c2 in _ctypes_callproc () from /usr/lib/python2.7/lib-dynload/_ctypes.so #10 0x00007f3bf1128aa2 in ?? () from /usr/lib/python2.7/lib-dynload/_ctypes.so #11 0x00000000004c7c76 in PyObject_Call () #12 0x000000000042aa4a in PyEval_EvalFrameEx () but ictx->parent_lock (which l2 should be using) is still valid: (gdb) p ictx->parent_lock $12 = {name = 0x7f3bec463e99 "librbd::ImageCtx::parent_lock", id = -1, recursive = false, lockdep = true, backtrace = false, _m = {__data = {__lock = 1, __count = 0, __owner = 12069, __nusers = 1, __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000%/\000\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 1}, nlock = 1, locked_by = 139895470249728, cct = 0x0, logger = 0x0}
Updated by Josh Durgin over 11 years ago
It looks like the op in the objecter has been corrupted, similar to #3613. In this case, op->objver ends up pointing to the same address as a Mutex::Locker on the stack in librbd.
(gdb) bt #0 0x00007fd44c584445 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007fd44c587bab in __GI_abort () at abort.c:91 #2 0x00007fd4489f069d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x00007fd4489ee846 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x00007fd4489ee873 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x00007fd4489ee96e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007fd44a11cd5a in ceph::__ceph_assert_fail (assertion=0x7fd44a3192a9 "&mutex == m", file=0x7fd44a319260 "./common/Mutex.h", line=125, func=0x7fd44a319c80 "void Mutex::Locker::check(Mutex*)") at common/assert.cc:77 #7 0x00007fd44a0a32e8 in Mutex::Locker::check (this=0x7fff2c6d9460, m=0x2c2fcc8) at ./common/Mutex.h:125 #8 0x00007fd44a0b47a9 in Objecter::handle_osd_op_reply (this=0x2baab10, m=0x7fd420001650) at osdc/Objecter.cc:1397 #9 0x00007fd44a08ab1f in librados::RadosClient::_dispatch (this=0x2ba2fd0, m=0x7fd420001650) at librados/RadosClient.cc:310 #10 0x00007fd44a08a90f in librados::RadosClient::ms_dispatch (this=0x2ba2fd0, m=0x7fd420001650) at librados/RadosClient.cc:280 #11 0x00007fd44a1b85e7 in Messenger::ms_deliver_dispatch (this=0x2ba9cf0, m=0x7fd420001650) at msg/Messenger.h:549 #12 0x00007fd44a1b7cf5 in DispatchQueue::entry (this=0x2ba9dd8) at msg/DispatchQueue.cc:107 #13 0x00007fd44a252d4c in DispatchQueue::DispatchThread::entry (this=0x2ba9ef0) at msg/DispatchQueue.h:85 #14 0x00007fd44a2bd719 in Thread::_entry_func (arg=0x2ba9ef0) at common/Thread.cc:41 #15 0x00007fd44d84ee9a in start_thread (arg=0x7fd446f1c700) at pthread_create.c:308 #16 0x00007fd44c6404bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #17 0x0000000000000000 in ?? () (gdb) frame 8 #8 0x00007fd44a0b47a9 in Objecter::handle_osd_op_reply (this=0x2baab10, m=0x7fd420001650) at osdc/Objecter.cc:1397 1397 in osdc/Objecter.cc (gdb) p rbd_test_locker_addr $1 = (Mutex::Locker *) 0x7fff2c6d9460 (gdb) p rbd_test_mutex_addr $2 = (Mutex *) 0x2c2fcc8 (gdb) p rbd_test_locker_addr->mutex $3 = (Mutex &) @0x137: <error reading variable> (gdb) p op $4 = (Objecter::Op *) 0x7fd43416dc40 (gdb) p *op $5 = {session = 0x2bb1cb0, session_item = {_item = 0x7fd43416dc40, _prev = 0x0, _next = 0x2d9deb8, _list = 0x2bb1cb0}, incarnation = 5, oid = {name = { static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2bdb5a8 "foo.rbd"}}}, oloc = {pool = 2, key = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7fd448c264d8 ""}}}, pgid = { m_pool = 2, m_seed = 1639484312, m_preferred = -1}, acting = {<std::_Vector_base<int, std::allocator<int> >> = { _M_impl = {<std::allocator<int>> = {<__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fd434038c30, _M_finish = 0x7fd434038c38, _M_end_of_storage = 0x7fd434038c38}}, <No data fields>}, used_replica = false, con = 0x0, ops = {<std::_Vector_base<OSDOp, std::allocator<OSDOp> >> = { _M_impl = {<std::allocator<OSDOp>> = {<__gnu_cxx::new_allocator<OSDOp>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fd43416db50, _M_finish = 0x7fd43416dc30, _M_end_of_storage = 0x7fd43416dc30}}, <No data fields>}, snapid = {val = 18446744073709551614}, snapc = {seq = {val = 0}, snaps = {<std::_Vector_base<snapid_t, std::allocator<snapid_t> >> = { _M_impl = {<std::allocator<snapid_t>> = {<__gnu_cxx::new_allocator<snapid_t>> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0, _M_end_of_storage = 0x0}}, <No data fields>}}, mtime = {tv = {tv_sec = 0, tv_nsec = 0}}, outbl = 0x0, out_bl = {<std::_Vector_base<ceph::buffer::list*, std::allocator<ceph::buffer::list*> >> = { _M_impl = {<std::allocator<ceph::buffer::list*>> = {<__gnu_cxx::new_allocator<ceph::buffer::list*>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fd434014040, _M_finish = 0x7fd434014048, _M_end_of_storage = 0x7fd434014048}}, <No data fields>}, out_handler = {<std::_Vector_base<Context*, std::allocator<Context*> >> = { _M_impl = {<std::allocator<Context*>> = {<__gnu_cxx::new_allocator<Context*>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fd434038c10, _M_finish = 0x7fd434038c18, _M_end_of_storage = 0x7fd434038c18}}, <No data fields>}, out_rval = {<std::_Vector_base<int*, std::allocator<int*> >> = { _M_impl = {<std::allocator<int*>> = {<__gnu_cxx::new_allocator<int*>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fd434014060, _M_finish = 0x7fd434014068, _M_end_of_storage = 0x7fd434014068}}, <No data fields>}, flags = 16, priority = 0, onack = 0x7fd434007b10, oncommit = 0x7fd434007b30, tid = 2002, version = {version = 0, epoch = 0, __pad = 0}, attempts = 1, paused = false, objver = 0x7fff2c6d9460, reply_epoch = 0x0, stamp = {tv = {tv_sec = 1355531899, tv_nsec = 97356000}}, precalc_pgid = false, map_dne_bound = 0, budgeted = false, should_resend = true} (gdb) p op->objver $6 = (eversion_t *) 0x7fff2c6d9460 (gdb) p *op->objver $7 = {version = 311, epoch = 242, __pad = 0} (gdb) p op->reply_epoch $8 = (epoch_t *) 0x0
I'll try reproducing without the recent objecter changes, and if it doesn't disappear, I'll probably try to bisect.
Updated by Tamilarasi muthamizhan over 11 years ago
recent log: ubuntu@teuthology:/a/teuthology-2012-12-15_19:00:04-regression-next-testing-basic/16281
Updated by Tamilarasi muthamizhan over 11 years ago
also, ubuntu@teuthology:/a/teuthology-2012-12-15_19:00:04-regression-next-testing-basic/16289
Updated by Josh Durgin over 11 years ago
- Status changed from 12 to Fix Under Review
wip-3611 contains a respin of the bad commit. It's passing test_stress_watch with failure injection and the python tests so far. I'm leaving it running the python tests overnight.
Updated by Josh Durgin over 11 years ago
This survived overnight testing (with the python librbd tests) with 56 passes.
Updated by Josh Durgin over 11 years ago
- Status changed from Fix Under Review to Resolved
This was caused by c3107009f66bc06b5e14c465142e14120f9a4412. Reverting it fixes the problem. There is a corrected implementation of that patch in the wip-3611 branch, but after discussing with Sam Just, I agree it's too complex for too little benefit, so keeping c3107009f66bc06b5e14c465142e14120f9a4412 reverted makes sense.