Project

General

Profile

Bug #5426

librbd: mutex assert in perfcounters::tinc in librbd::AioCompletion::complete()

Added by Sage Weil over 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

100%

Source:
Q/A
Tags:
Backport:
dumpling, emperor
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2013-06-22T12:24:53.336 INFO:teuthology.orchestra.run.err:common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f32aaffd700 time 2013-06-22 12:24:55.286391
2013-06-22T12:24:53.336 INFO:teuthology.orchestra.run.err:common/Mutex.cc: 93: FAILED assert(r == 0)
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: ceph version 0.61.4-5-gd572cf6 (d572cf6f77418f217a5a8e37f1124dc566e24d0b)
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: 1: (Mutex::Lock(bool)+0x1c3) [0x7f32b5362593]
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: 2: (PerfCounters::tinc(int, utime_t)+0x5e) [0x7f32b535fdbe]
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: 3: (librbd::AioCompletion::complete()+0x210) [0x7f32b5f72ac0]
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: 4: (librbd::AioCompletion::complete_request(CephContext*, long)+0x1c7) [0x7f32b5f71ed7]
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: 5: (librbd::C_AioRead::finish(int)+0x1d6) [0x7f32b5f72336]
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: 6: (Context::complete(int)+0xa) [0x7f32b5f7281a]
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: 7: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f32b5f86517]
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: 8: (librados::C_AioComplete::finish(int)+0x1d) [0x7f32b530b12d]
2013-06-22T12:24:53.337 INFO:teuthology.orchestra.run.err: 9: (Finisher::finisher_thread_entry()+0x1c0) [0x7f32b5378bb0]
2013-06-22T12:24:53.338 INFO:teuthology.orchestra.run.err: 10: (()+0x7e9a) [0x7f32b4902e9a]
2013-06-22T12:24:53.338 INFO:teuthology.orchestra.run.err: 11: (clone()+0x6d) [0x7f32b4c0bccd]
2013-06-22T12:24:53.338 INFO:teuthology.orchestra.run.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

run was
ubuntu@teuthology:/a/teuthology-2013-06-21_20:00:30-rbd-cuttlefish-testing-basic/42512$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 2dd322b42d608a37f3e5beed57a8fbc673da6e32
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: cuttlefish
  ceph:
    conf:
      client:
        rbd cache: false
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: d572cf6f77418f217a5a8e37f1124dc566e24d0b
  install:
    ceph:
      sha1: d572cf6f77418f217a5a8e37f1124dc566e24d0b
  s3tests:
    branch: cuttlefish
  workunit:
    sha1: d572cf6f77418f217a5a8e37f1124dc566e24d0b
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rbd_fsx:
    clients:
    - client.0
    ops: 2000

Subtasks

Bug #8409: "Segmentation fault" in rbd in upgrade:dumpling-dumpling-testing-basic-vpsDuplicate

Associated revisions

Revision 4cea7895 (diff)
Added by Josh Durgin almost 10 years ago

librbd: call user completion after incrementing perfcounters

The perfcounters (and the ictx) are only valid while the image is
still open. If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.

The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.

Fixes: #5426
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <>

Revision 8f3ad4e3
Added by Josh Durgin almost 10 years ago

Merge pull request #1000 from ceph/wip-rbd-tinc-5426

fix #5426 race in librbd

Reviewed-by: Sage Weil <>

Revision 1abb1698 (diff)
Added by Josh Durgin over 9 years ago

librbd: call user completion after incrementing perfcounters

The perfcounters (and the ictx) are only valid while the image is
still open. If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.

The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.

Fixes: #5426
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <>
(cherry picked from commit 4cea7895da7331b84d8c6079851fdc0ff2f4afb1)

Revision 2f25bfea (diff)
Added by Josh Durgin over 9 years ago

librbd: call user completion after incrementing perfcounters

The perfcounters (and the ictx) are only valid while the image is
still open. If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.

The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.

Fixes: #5426
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <>
(cherry picked from commit 4cea7895da7331b84d8c6079851fdc0ff2f4afb1)

History

#1 Updated by Ian Colle over 10 years ago

  • Assignee set to Josh Durgin

#2 Updated by Sage Weil about 10 years ago

  • Status changed from New to 12
2013-07-09T20:46:21.561 INFO:teuthology.orchestra.run.err:common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fb3b1ffb700 time 2013-07-09 20:46:41.372806
2013-07-09T20:46:21.561 INFO:teuthology.orchestra.run.err:common/Mutex.cc: 93: FAILED assert(r == 0)
2013-07-09T20:46:21.567 INFO:teuthology.orchestra.run.err: ceph version 0.61.4-48-g24f90b8 (24f90b832c695ef13021db66a178c18369ac356d)
2013-07-09T20:46:21.567 INFO:teuthology.orchestra.run.err: 1: (Mutex::Lock(bool)+0x1c3) [0x7fb3bc4410d3]
2013-07-09T20:46:21.567 INFO:teuthology.orchestra.run.err: 2: (PerfCounters::tinc(int, utime_t)+0x5e) [0x7fb3bc43e8fe]
2013-07-09T20:46:21.567 INFO:teuthology.orchestra.run.err: 3: (librbd::AioCompletion::complete()+0x210) [0x7fb3bd053ac0]
2013-07-09T20:46:21.567 INFO:teuthology.orchestra.run.err: 4: (librbd::AioCompletion::complete_request(CephContext*, long)+0x1c7) [0x7fb3bd052ed7]
2013-07-09T20:46:21.567 INFO:teuthology.orchestra.run.err: 5: (librbd::C_AioRead::finish(int)+0x1d6) [0x7fb3bd053336]
2013-07-09T20:46:21.568 INFO:teuthology.orchestra.run.err: 6: (Context::complete(int)+0xa) [0x7fb3bd05381a]
2013-07-09T20:46:21.568 INFO:teuthology.orchestra.run.err: 7: (librbd::rados_req_cb(void*, void*)+0x47) [0x7fb3bd067517]
2013-07-09T20:46:21.568 INFO:teuthology.orchestra.run.err: 8: (librados::C_AioComplete::finish(int)+0x1d) [0x7fb3bc3e9c6d]
2013-07-09T20:46:21.568 INFO:teuthology.orchestra.run.err: 9: (Finisher::finisher_thread_entry()+0x1c0) [0x7fb3bc4576f0]
2013-07-09T20:46:21.568 INFO:teuthology.orchestra.run.err: 10: (()+0x7e9a) [0x7fb3bb9e0e9a]
2013-07-09T20:46:21.568 INFO:teuthology.orchestra.run.err: 11: (clone()+0x6d) [0x7fb3bbce9ccd]
2013-07-09T20:46:21.568 INFO:teuthology.orchestra.run.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-07-09T20:46:21.569 INFO:teuthology.orchestra.run.err:terminate called after throwing an instance of 'ceph::FailedAssertion'

job was
ubuntu@teuthology:/a/teuthology-2013-07-09_20:00:36-rbd-cuttlefish-testing-basic/61034$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 365b57b1317524bb0cdd15859a224ba1ab58d1d7
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: cuttlefish
  ceph:
    conf:
      client:
        rbd cache: false
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    fs: xfs
    log-whitelist:
    - slow request
    sha1: 24f90b832c695ef13021db66a178c18369ac356d
  install:
    ceph:
      sha1: 24f90b832c695ef13021db66a178c18369ac356d
  s3tests:
    branch: cuttlefish
  workunit:
    sha1: 24f90b832c695ef13021db66a178c18369ac356d
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rbd_fsx:
    clients:
    - client.0
    ops: 2000

#3 Updated by Sage Weil about 10 years ago

both of these are cuttlefish.. maybe something that hasn't been backported?

#4 Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/a/teuthology-2013-07-13_20:00:45-rbd-cuttlefish-testing-basic/66483$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 365b57b1317524bb0cdd15859a224ba1ab58d1d7
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: cuttlefish
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 658240710baaf9c661b8fbf856322907a0d394ee
  install:
    ceph:
      sha1: 658240710baaf9c661b8fbf856322907a0d394ee
  s3tests:
    branch: cuttlefish
  workunit:
    sha1: 658240710baaf9c661b8fbf856322907a0d394ee
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    conf:
      client:
        rbd cache: false
- rbd_fsx:
    clients:
    - client.0
    ops: 5000

cuttlefish again!

#5 Updated by Sage Weil about 10 years ago

okay, not cuttlefish.

ubuntu@teuthology:/a/teuthology-2013-07-17_01:00:47-rbd-next-testing-basic/70375

2013-07-17T10:14:13.999 INFO:teuthology.task.workunit.client.0.err:     0> 2013-07-17 10:14:41.818619 7f00ae7fc700 -1 *** Caught signal (Segmentation fault) **
2013-07-17T10:14:13.999 INFO:teuthology.task.workunit.client.0.err: in thread 7f00ae7fc700
2013-07-17T10:14:13.999 INFO:teuthology.task.workunit.client.0.err:
2013-07-17T10:14:13.999 INFO:teuthology.task.workunit.client.0.err: ceph version 0.66-669-g884fa2f (884fa2fcb6d707b23317bab1da909586ddc27608)
2013-07-17T10:14:13.999 INFO:teuthology.task.workunit.client.0.err: 1: rbd() [0x41f2aa]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 2: (()+0xfcb0) [0x7f00b8569cb0]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 3: (PerfCounters::tinc(int, utime_t)+0x28) [0x7f00b895f6b8]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 4: (librbd::AioCompletion::complete()+0x210) [0x7f00b95e95d0]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 5: (librbd::AioCompletion::complete_request(CephContext*, long)+0x1c7) [0x7f00b95e8a37]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 6: (librbd::C_AioRead::finish(int)+0x15d) [0x7f00b95e8e1d]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 7: (Context::complete(int)+0xa) [0x7f00b95e932a]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 8: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f00b95fd007]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 9: (librados::C_AioComplete::finish(int)+0x1d) [0x7f00b88fef7d]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 10: (Finisher::finisher_thread_entry()+0x1c0) [0x7f00b89826b0]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 11: (()+0x7e9a) [0x7f00b8561e9a]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: 12: (clone()+0x6d) [0x7f00b7b74ccd]
2013-07-17T10:14:14.000 INFO:teuthology.task.workunit.client.0.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#6 Updated by Josh Durgin about 10 years ago

  • Status changed from 12 to In Progress

#7 Updated by Sage Weil about 10 years ago

  • Priority changed from Urgent to High

#8 Updated by Josh Durgin about 10 years ago

Reproduced after 22 tries with no debugging on. Couldn't reproduce without debugging, but this is a start.

The PerfCounters object hasn't been destroyed yet, but the _conf and other fields that its m_cct references is invalid, so m_cct is probably already destroyed at this point.

p *(this->m_cct)
$8 = {nref = {val = 23004272}, _conf = 0x671, _log = 0x7facf203ebf8, _module_type = 4060343288, _service_thread = 0x15e8960, _log_obs = 0x15e8960, _admin_socket = 0x30, 
  _service_thread_lock = 769, _perf_counters_collection = 0x15b7570, _perf_counters_conf_obs = 0x16af240, _admin_hook = 0x10000fffffffe, 
  _heartbeat_map = 0x722032393935363a, _crypto_none = 0x2e617461645f6462, _crypto_aes = 0x3166636463303031}

#9 Updated by Josh Durgin about 10 years ago

nevermind, the PerfCounters is invalid too, with m_lower_bound = 22777856, m_upper_bound = 0

#10 Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/a/teuthology-2013-08-22_01:01:01-rbd-next-testing-basic-plana/888

#11 Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/a/teuthology-2013-08-23_00:30:06-ceph-deploy-master---basic-saya/1087

slightly different, though:

...

Thread 10 (Thread 0x7f9d1c19e700 (LWP 4064)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
---Type <return> to continue, or q <return> to quit---
#1  0x00007f9d1df100ab in std::vector<ceph::log::Subsystem, std::allocator<ceph::log::Subsystem> >::_M_fill_insert(__gnu_cxx::__normal_iterator<ceph::log::Subsystem*, std::vector<ceph::log::Subsystem, std::allocator<ceph::log::Subsystem> > >, unsigned long, ceph::log::Subsystem const&) ()
   from /usr/lib/librados.so.2
#2  0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f9d1b19c700 (LWP 4066)):
#0  0x00007f9d1d6cb313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f9d1de9561b in std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> >::~pair() () from /usr/lib/librados.so.2
#2  0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f9d19999700 (LWP 4069)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
#1  0x00007f9d1dea9ebb in Finisher::wait_for_empty() () from /usr/lib/librados.so.2
#2  0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f9d0acf3700 (LWP 8025)):
#0  0x00007f9d1d6cb313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f9d1df8ed72 in Pipe::read_message(Message**) () from /usr/lib/librados.so.2
#2  0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f9d0adf4700 (LWP 8024)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f9d1df9b3db in Pipe::writer() () from /usr/lib/librados.so.2
#2  0x00007f9d1dfa66bd in ?? () from /usr/lib/librados.so.2
#3  0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f9d0abf2700 (LWP 8041)):
#0  0x00007f9d1d6cb313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f9d1df8ed72 in Pipe::read_message(Message**) () from /usr/lib/librados.so.2
#2  0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f9d1ef9a780 (LWP 4063)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f9d1de173b3 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, Objecter::PoolOp*>, std::_Select1st<std::pair<unsigned long const, Objecter::PoolOp*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Objecter::PoolOp*> > >::_M_erase(std::_Rb_tree_node<std::pair<unsigned long const, Objecter::PoolOp*> >*) () from /usr/lib/librados.so.2
#2  0x00007f9d1de2285b in librados::IoCtxImpl::aio_sparse_read(object_t, librados::AioCompletionImpl*, std::map<unsigned long, unsigned long, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned long> > >*, ceph::buffer::list*, unsigned long, unsigned long, unsigned long) () from /usr/lib/librados.so.2
#3  0x00007fff1216c820 in ?? ()
#4  0x00007fff1216c820 in ?? ()
#5  0x0000000100000000 in ?? ()
#6  0x00007fff1216c820 in ?? ()
#7  0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f9d0b1f8700 (LWP 8080)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f9d1df9b3db in Pipe::writer() () from /usr/lib/librados.so.2
#2  0x00007f9d1dfa66bd in ?? () from /usr/lib/librados.so.2
#3  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f9d0aef5700 (LWP 8039)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f9d1df9b3db in Pipe::writer() () from /usr/lib/librados.so.2
#2  0x00007f9d1dfa66bd in ?? () from /usr/lib/librados.so.2
#3  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f9d0bfff700 (LWP 4074)):
#0  0x00007f9d1de87188 in PerfCountersCollection::remove(PerfCounters*) () from /usr/lib/librados.so.2
#1  0x0000003000000003 in ?? ()
#2  0x0000000000020000 in ?? ()
#3  0x00007f9d0bffe778 in ?? ()
#4  0x00007f9d0bffe778 in ?? ()
#5  0x00007f9c00000000 in ?? ()
#6  0x0000000000000000 in ?? ()
(gdb) q

#12 Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/a/teuthology-2013-09-05_01:00:58-rbd-next-testing-basic-plana/22153

#13 Updated by Tamilarasi muthamizhan about 10 years ago

ubuntu@teuthology:/a/teuthology-2013-09-09_01:36:24-upgrade-small-next-testing-basic-vps/27433

#15 Updated by Sage Weil almost 10 years ago

  • Priority changed from High to Normal

#16 Updated by Tamilarasi muthamizhan almost 10 years ago

latest logs: ubuntu@teuthology:/a/teuthology-2013-11-08_05:35:02-upgrade-small-next-testing-basic-vps/90358

#17 Updated by Ian Colle almost 10 years ago

  • Priority changed from Normal to Urgent

Upgraded to Urgent since causing failures in nightlies

#18 Updated by Tamilarasi muthamizhan almost 10 years ago

recent logs : ubuntu@teuthology:/a/teuthology-2013-11-21_23:01:11-rbd-next-testing-basic-plana/113257

2013-11-22T03:56:35.652 INFO:teuthology.orchestra.run.out:[10.214.132.22]: 3523 flatten
2013-11-22T03:56:38.000 INFO:teuthology.orchestra.run.out:[10.214.132.22]: 3525 trunc   from 0x62e39fe to 0x1b9c7ec
2013-11-22T03:56:38.135 INFO:teuthology.orchestra.run.out:[10.214.132.22]: 3529 punch   from 0x1b7c8d9 to 0x1b8ae76, (0xe59d bytes)
2013-11-22T03:56:38.178 INFO:teuthology.orchestra.run.out:[10.214.132.22]: 3530 trunc   from 0x1b9c7ec to 0x74f792a
2013-11-22T03:56:38.210 INFO:teuthology.orchestra.run.out:[10.214.132.22]: 3531 clone   20 order 18 su 131072 sc 12
2013-11-22T03:56:39.281 INFO:teuthology.orchestra.run.out:[10.214.132.22]: checking clone #18, image image_client.0-clone18 against file /home/ubuntu/cephtest/archive/fsx-image_client.0-parent19
2013-11-22T03:56:41.435 INFO:teuthology.orchestra.run.err:[10.214.132.22]: common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f291d4f1700 time 2013-11-22 03:56:41.432336
2013-11-22T03:56:41.435 INFO:teuthology.orchestra.run.err:[10.214.132.22]: common/Mutex.cc: 93: FAILED assert(r == 0)
2013-11-22T03:56:41.435 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  ceph version 0.72-214-g203065c (203065c076daf9aaa9ade3d7cdd1b10e6700bfe6)
2013-11-22T03:56:41.436 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  1: (Mutex::Lock(bool)+0x1c3) [0x7f29232fc0e3]
2013-11-22T03:56:41.436 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  2: (PerfCounters::tinc(int, utime_t)+0x5e) [0x7f29232f983e]
2013-11-22T03:56:41.436 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  3: (librbd::AioCompletion::complete()+0x210) [0x7f2923fecb10]
2013-11-22T03:56:41.436 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  4: (librbd::AioCompletion::complete_request(CephContext*, long)+0x177) [0x7f2923febfd7]
2013-11-22T03:56:41.437 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  5: (librbd::C_AioRead::finish(int)+0x104) [0x7f2923fec2d4]
2013-11-22T03:56:41.437 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  6: (Context::complete(int)+0x9) [0x7f2923fec769]
2013-11-22T03:56:41.437 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  7: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f2923ff6f57]
2013-11-22T03:56:41.437 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  8: (librados::C_AioComplete::finish(int)+0x1d) [0x7f292329024d]
2013-11-22T03:56:41.438 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  9: (Context::complete(int)+0x9) [0x7f2923270e69]
2013-11-22T03:56:41.438 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  10: (Finisher::finisher_thread_entry()+0x1c0) [0x7f292331d7d0]
2013-11-22T03:56:41.438 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  11: (()+0x7e9a) [0x7f2922825e9a]
2013-11-22T03:56:41.439 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  12: (clone()+0x6d) [0x7f2922b2eccd]
2013-11-22T03:56:41.439 INFO:teuthology.orchestra.run.err:[10.214.132.22]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-11-22T03:56:41.439 INFO:teuthology.orchestra.run.err:[10.214.132.22]: terminate called after throwing an instance of 'ceph::FailedAssertion'
2013-11-22T03:56:52.443 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-next/teuthology/run_tasks.py", line 33, in run_tasks
    manager.__enter__()
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/teuthworker/teuthology-next/teuthology/task/rbd_fsx.py", line 40, in task
    p.spawn(_run_one_client, ctx, config, role)
  File "/home/teuthworker/teuthology-next/teuthology/parallel.py", line 83, in __exit__
    for result in self:
  File "/home/teuthworker/teuthology-next/teuthology/parallel.py", line 100, in next
    resurrect_traceback(result)
  File "/home/teuthworker/teuthology-next/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/teuthology-next/teuthology/task/rbd_fsx.py", line 62, in _run_one_client
    'image_{image}'.format(image=role),
  File "/home/teuthworker/teuthology-next/teuthology/orchestra/remote.py", line 47, in run
    r = self._runner(client=self.ssh, **kwargs)
  File "/home/teuthworker/teuthology-next/teuthology/orchestra/run.py", line 271, in run
    r.exitstatus = _check_status(r.exitstatus)
  File "/home/teuthworker/teuthology-next/teuthology/orchestra/run.py", line 264, in _check_status
    raise CommandCrashedError(command=r.command)
CommandCrashedError: Command crashed: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph_test_librbd_fsx -d -W -R -p 100 -P /home/ubuntu/cephtest/archive -t 1 -l 250000000 -S 0 -N 5000 pool_client.0 image_client.0'
2013-11-22T03:56:52.650 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/inktank/teuthology/search?q=c31f734e23384b0396a3d735e27ad1f7
CommandCrashedError: Command crashed: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph_test_librbd_fsx -d -W -R -p 100 -P /home/ubuntu/cephtest/archive -t 1 -l 250000000 -S 0 -N 5000 pool_client.0 image_client.0'

#19 Updated by Dmitry Borodaenko almost 10 years ago

This problem is easily reproducible with Mirantis OpenStack 4.0:
https://bugs.launchpad.net/fuel/+bug/1260911

After converting and uploading a RAW CirrOS image, all you have to do is launch a VM from that image, then run rbd -p compute export <instance-id>_disk - > /dev/null.

#20 Updated by Josh Durgin almost 10 years ago

  • Status changed from In Progress to Fix Under Review
  • Backport set to dumpling, emperor

#21 Updated by Josh Durgin almost 10 years ago

  • Status changed from Fix Under Review to Pending Backport

I cherry-picked the fix to dumpling-5426 and emperor-5426 branches based on dumpling and emperor, respectively. The fix will go on the actual dumpling and emperor branches once its passed nightly tests for a while.

#22 Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved

#23 Updated by Dmitry Borodaenko over 9 years ago

Fix confirmed with the packages from http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/dumpling-5426/ on Mirantis OpenStack 4.0, SEGV no longer occurs.

Also available in: Atom PDF