Bug #3611: rbd.py: segfault with many snapshots - rbd - Ceph

Actions

Copy link

Bug #3611

closed

rbd.py: segfault with many snapshots

Added by Josh Durgin over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Josh Durgin

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

From the nightly python api tests, test_many_snaps failed with a segfault in all runs.
Logs are in:

ubuntu@teuthology:/a/teuthology-2012-12-11_19:00:03-regression-next-testing-basic/{11886,11879,11872,11869}

Actions

Copy link

Updated by Josh Durgin over 11 years ago

Without lockdep, I could not reproduce a crash.

Running with lockdep enabled results in this backtrace:

common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int)' thread 7f9226f70710 time 2012-12-12 14:47:24.851828
common/lockdep.cc: 181: FAILED assert(0)
 ceph version 0.55-232-g54618af (54618afab2d1634e1fac218a021a07c3905e5c20)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f922c228459]
 2: (lockdep_will_lock(char const*, int)+0x435) [0x7f922c3fad48]
 3: (Mutex::_will_lock()+0x24) [0x7f922c2126ae]
 4: (Mutex::Lock(bool)+0x41) [0x7f922c2124c5]
 5: (librbd::AioCompletion::get_return_value()+0x1d) [0x7f92264c51db]
 6: (librbd::rbd_ctx_cb(void*, void*)+0x3b) [0x7f92264f9665]
 7: (librbd::AioCompletion::complete()+0xd6) [0x7f92264c8246]
 8: (librbd::AioCompletion::complete_request(CephContext*, long)+0x2ad) [0x7f92264c7777]
 9: (librbd::C_AioWrite::finish(int)+0x30) [0x7f92264fe834]
 10: (Context::complete(int)+0x2b) [0x7f92264c8045]
 11: (librbd::AioRequest::complete(int)+0x6f) [0x7f92264c80d5]
 12: (librbd::rados_req_cb(void*, void*)+0x34) [0x7f92264fb075]
 13: (librados::C_AioSafe::finish(int)+0x4d) [0x7f922c1ab697]
 14: (Finisher::finisher_thread_entry()+0x389) [0x7f922c227507]
 15: (Finisher::FinisherThread::entry()+0x1c) [0x7f922c191f74]
 16: (Thread::_entry_func(void*)+0x23) [0x7f922c3e0759]
 17: (()+0x68ba) [0x7f922ee5b8ba]
 18: (clone()+0x6d) [0x7f922dd2d02d]

This lock is a recursive lock, which lockdep is generating a false positive for.

Actions

Copy link

Updated by Josh Durgin over 11 years ago

Priority changed from Urgent to Normal

Downgrading priority since this isn't an actual bug.

Actions

Copy link

Updated by Josh Durgin over 11 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Josh Durgin over 11 years ago

Finally got a backtrace. It seems something is overwriting a Mutex::Locker on the stack:

#0  0x00007f3bec3f46ed in Mutex::_pre_unlock (this=0x180) at ./common/Mutex.h:96
#1  0x00007f3bec3f47ba in Mutex::Unlock (this=0x180) at ./common/Mutex.h:105
#2  0x00007f3bec3fb429 in Mutex::Locker::~Locker (this=0x7fff3b6d9310, __in_chrg=<optimized out>) at ./common/Mutex.h:123
#3  0x00007f3bec4202d3 in librbd::ictx_refresh (ictx=0x30da440) at librbd/internal.cc:1677
#4  0x00007f3bec41ea0c in librbd::ictx_check (ictx=0x30da440) at librbd/internal.cc:1509
#5  0x00007f3bec412941 in librbd::snap_remove (ictx=0x30da440, snap_name=0x2fff6e4 "106") at librbd/internal.cc:467
#6  0x00007f3bec3f3076 in rbd_snap_remove (image=0x30da440, snap_name=0x2fff6e4 "106") at librbd/librbd.cc:748
#7  0x00007f3bf1137ea4 in ffi_call_unix64 () from /usr/lib/python2.7/lib-dynload/_ctypes.so
#8  0x00007f3bf11378c5 in ffi_call () from /usr/lib/python2.7/lib-dynload/_ctypes.so
#9  0x00007f3bf11282c2 in _ctypes_callproc () from /usr/lib/python2.7/lib-dynload/_ctypes.so
#10 0x00007f3bf1128aa2 in ?? () from /usr/lib/python2.7/lib-dynload/_ctypes.so
#11 0x00000000004c7c76 in PyObject_Call ()
#12 0x000000000042aa4a in PyEval_EvalFrameEx ()

but ictx->parent_lock (which l2 should be using) is still valid:

(gdb) p ictx->parent_lock
$12 = {name = 0x7f3bec463e99 "librbd::ImageCtx::parent_lock", id = -1, recursive = false, lockdep = true, backtrace = false, _m = {__data = {__lock = 1, __count = 0, 
      __owner = 12069, __nusers = 1, __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\001\000\000\000\000\000\000\000%/\000\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 1}, nlock = 1, locked_by = 139895470249728, 
  cct = 0x0, logger = 0x0}

Actions

Copy link

Updated by Josh Durgin over 11 years ago

It looks like the op in the objecter has been corrupted, similar to #3613. In this case, op->objver ends up pointing to the same address as a Mutex::Locker on the stack in librbd.

(gdb) bt
#0  0x00007fd44c584445 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fd44c587bab in __GI_abort () at abort.c:91
#2  0x00007fd4489f069d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fd4489ee846 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fd4489ee873 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fd4489ee96e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fd44a11cd5a in ceph::__ceph_assert_fail (assertion=0x7fd44a3192a9 "&mutex == m", file=0x7fd44a319260 "./common/Mutex.h", line=125, 
    func=0x7fd44a319c80 "void Mutex::Locker::check(Mutex*)") at common/assert.cc:77
#7  0x00007fd44a0a32e8 in Mutex::Locker::check (this=0x7fff2c6d9460, m=0x2c2fcc8) at ./common/Mutex.h:125
#8  0x00007fd44a0b47a9 in Objecter::handle_osd_op_reply (this=0x2baab10, m=0x7fd420001650) at osdc/Objecter.cc:1397
#9  0x00007fd44a08ab1f in librados::RadosClient::_dispatch (this=0x2ba2fd0, m=0x7fd420001650) at librados/RadosClient.cc:310
#10 0x00007fd44a08a90f in librados::RadosClient::ms_dispatch (this=0x2ba2fd0, m=0x7fd420001650) at librados/RadosClient.cc:280
#11 0x00007fd44a1b85e7 in Messenger::ms_deliver_dispatch (this=0x2ba9cf0, m=0x7fd420001650) at msg/Messenger.h:549
#12 0x00007fd44a1b7cf5 in DispatchQueue::entry (this=0x2ba9dd8) at msg/DispatchQueue.cc:107
#13 0x00007fd44a252d4c in DispatchQueue::DispatchThread::entry (this=0x2ba9ef0) at msg/DispatchQueue.h:85
#14 0x00007fd44a2bd719 in Thread::_entry_func (arg=0x2ba9ef0) at common/Thread.cc:41
#15 0x00007fd44d84ee9a in start_thread (arg=0x7fd446f1c700) at pthread_create.c:308
#16 0x00007fd44c6404bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#17 0x0000000000000000 in ?? ()
(gdb) frame 8
#8  0x00007fd44a0b47a9 in Objecter::handle_osd_op_reply (this=0x2baab10, m=0x7fd420001650) at osdc/Objecter.cc:1397
1397    in osdc/Objecter.cc
(gdb) p rbd_test_locker_addr
$1 = (Mutex::Locker *) 0x7fff2c6d9460
(gdb) p rbd_test_mutex_addr
$2 = (Mutex *) 0x2c2fcc8
(gdb) p rbd_test_locker_addr->mutex
$3 = (Mutex &) @0x137: <error reading variable>
(gdb) p op
$4 = (Objecter::Op *) 0x7fd43416dc40
(gdb) p *op
$5 = {session = 0x2bb1cb0, session_item = {_item = 0x7fd43416dc40, _prev = 0x0, _next = 0x2d9deb8, _list = 0x2bb1cb0}, incarnation = 5, oid = {name = {
      static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
        _M_p = 0x2bdb5a8 "foo.rbd"}}}, oloc = {pool = 2, key = {static npos = <optimized out>, 
      _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7fd448c264d8 ""}}}, pgid = {
    m_pool = 2, m_seed = 1639484312, m_preferred = -1}, acting = {<std::_Vector_base<int, std::allocator<int> >> = {
      _M_impl = {<std::allocator<int>> = {<__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fd434038c30, 
        _M_finish = 0x7fd434038c38, _M_end_of_storage = 0x7fd434038c38}}, <No data fields>}, used_replica = false, con = 0x0, 
  ops = {<std::_Vector_base<OSDOp, std::allocator<OSDOp> >> = {
      _M_impl = {<std::allocator<OSDOp>> = {<__gnu_cxx::new_allocator<OSDOp>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fd43416db50, 
        _M_finish = 0x7fd43416dc30, _M_end_of_storage = 0x7fd43416dc30}}, <No data fields>}, snapid = {val = 18446744073709551614}, snapc = {seq = {val = 0}, 
    snaps = {<std::_Vector_base<snapid_t, std::allocator<snapid_t> >> = {
        _M_impl = {<std::allocator<snapid_t>> = {<__gnu_cxx::new_allocator<snapid_t>> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0, 
          _M_end_of_storage = 0x0}}, <No data fields>}}, mtime = {tv = {tv_sec = 0, tv_nsec = 0}}, outbl = 0x0, 
  out_bl = {<std::_Vector_base<ceph::buffer::list*, std::allocator<ceph::buffer::list*> >> = {
      _M_impl = {<std::allocator<ceph::buffer::list*>> = {<__gnu_cxx::new_allocator<ceph::buffer::list*>> = {<No data fields>}, <No data fields>}, 
        _M_start = 0x7fd434014040, _M_finish = 0x7fd434014048, _M_end_of_storage = 0x7fd434014048}}, <No data fields>}, 
  out_handler = {<std::_Vector_base<Context*, std::allocator<Context*> >> = {
      _M_impl = {<std::allocator<Context*>> = {<__gnu_cxx::new_allocator<Context*>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fd434038c10, 
        _M_finish = 0x7fd434038c18, _M_end_of_storage = 0x7fd434038c18}}, <No data fields>}, out_rval = {<std::_Vector_base<int*, std::allocator<int*> >> = {
      _M_impl = {<std::allocator<int*>> = {<__gnu_cxx::new_allocator<int*>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fd434014060, 
        _M_finish = 0x7fd434014068, _M_end_of_storage = 0x7fd434014068}}, <No data fields>}, flags = 16, priority = 0, onack = 0x7fd434007b10, 
  oncommit = 0x7fd434007b30, tid = 2002, version = {version = 0, epoch = 0, __pad = 0}, attempts = 1, paused = false, objver = 0x7fff2c6d9460, reply_epoch = 0x0, 
  stamp = {tv = {tv_sec = 1355531899, tv_nsec = 97356000}}, precalc_pgid = false, map_dne_bound = 0, budgeted = false, should_resend = true}
(gdb) p op->objver
$6 = (eversion_t *) 0x7fff2c6d9460
(gdb) p *op->objver
$7 = {version = 311, epoch = 242, __pad = 0}
(gdb) p op->reply_epoch
$8 = (epoch_t *) 0x0

I'll try reproducing without the recent objecter changes, and if it doesn't disappear, I'll probably try to bisect.

Actions

Copy link

Updated by Tamilarasi muthamizhan over 11 years ago

recent log: ubuntu@teuthology:/a/teuthology-2012-12-15_19:00:04-regression-next-testing-basic/16281

Actions

Copy link

Updated by Tamilarasi muthamizhan over 11 years ago

also, ubuntu@teuthology:/a/teuthology-2012-12-15_19:00:04-regression-next-testing-basic/16289

Actions

Copy link

Updated by Josh Durgin over 11 years ago

Status changed from 12 to Fix Under Review

wip-3611 contains a respin of the bad commit. It's passing test_stress_watch with failure injection and the python tests so far. I'm leaving it running the python tests overnight.

Actions

Copy link

Updated by Josh Durgin over 11 years ago

This survived overnight testing (with the python librbd tests) with 56 passes.

Actions

Copy link

#10

Updated by Josh Durgin over 11 years ago

Status changed from Fix Under Review to Resolved

This was caused by c3107009f66bc06b5e14c465142e14120f9a4412. Reverting it fixes the problem. There is a corrected implementation of that patch in the wip-3611 branch, but after discussing with Sam Just, I agree it's too complex for too little benefit, so keeping c3107009f66bc06b5e14c465142e14120f9a4412 reverted makes sense.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #3611

rbd.py: segfault with many snapshots

Updated by Josh Durgin over 11 years ago

Updated by Josh Durgin over 11 years ago

Updated by Josh Durgin over 11 years ago

Updated by Josh Durgin over 11 years ago

Updated by Josh Durgin over 11 years ago

Updated by Tamilarasi muthamizhan over 11 years ago

Updated by Tamilarasi muthamizhan over 11 years ago

Updated by Josh Durgin over 11 years ago

Updated by Josh Durgin over 11 years ago

Updated by Josh Durgin over 11 years ago