Bug #3524
closedtest_librbd_fsx: crash after flatten
0%
Description
#0 0x00007f5d6bd66d67 in librbd::aio_read (ictx=0x0, image_extents=..., buf=0x0, pbl=0x7f5c40006100, c=0x7f5c510ce370) at librbd/internal.cc:2589 2589 librbd/internal.cc: No such file or directory. (gdb) bt #0 0x00007f5d6bd66d67 in librbd::aio_read (ictx=0x0, image_extents=..., buf=0x0, pbl=0x7f5c40006100, c=0x7f5c510ce370) at librbd/internal.cc:2589 #1 0x00007f5d6bd395c9 in librbd::AioRequest::read_from_parent (this=0x7f5c400060b0, image_extents=...) at librbd/AioRequest.cc:60 #2 0x00007f5d6bd3a596 in librbd::AbstractWrite::should_complete (this=0x7f5c400060b0, r=-2) at librbd/AioRequest.cc:168 #3 0x00007f5d6bd3773b in librbd::AioRequest::complete (this=0x7f5c400060b0, r=-2) at ./librbd/AioRequest.h:37 #4 0x00007f5d6bd6524a in librbd::rados_req_cb (c=0x7f5c40009f50, arg=0x7f5c400060b0) at librbd/internal.cc:2345 #5 0x00007f5d6aecc0d1 in librados::C_AioSafe::finish (this=0x7f5d50269520, r=0) at ./librados/AioCompletionImpl.h:190 #6 0x00007f5d6af40da6 in Finisher::finisher_thread_entry (this=0x1692df8) at common/Finisher.cc:56 #7 0x00007f5d6aeb3dba in Finisher::FinisherThread::entry (this=0x1692f10) at ./common/Finisher.h:46 #8 0x00007f5d6b0dfce9 in Thread::_entry_func (arg=0x1692f10) at common/Thread.cc:41 #9 0x00007f5d69eb1e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #10 0x00007f5d6a1b94bd in klogctl () from /lib/x86_64-linux-gnu/libc.so.6 #11 0x0000000000000000 in ?? () (gdb) f 0 #0 0x00007f5d6bd66d67 in librbd::aio_read (ictx=0x0, image_extents=..., buf=0x0, pbl=0x7f5c40006100, c=0x7f5c510ce370) at librbd/internal.cc:2589 2589 in librbd/internal.cc (gdb) up #1 0x00007f5d6bd395c9 in librbd::AioRequest::read_from_parent (this=0x7f5c400060b0, image_extents=...) at librbd/AioRequest.cc:60 60 librbd/AioRequest.cc: No such file or directory.
m_ictx->parent == NULL?
ubuntu@teuthology:/a/sage-2012-11-24_15:06:57-regression-next-testing-basic/3533
last few lines of output:
2012-11-24T19:42:25.427 INFO:teuthology.orchestra.run.out:1143 punch from 0x3cd4251c to 0x3cd47a44, (0x5528 bytes) 2012-11-24T19:42:25.740 INFO:teuthology.orchestra.run.out:1144 write 0x8610a1a thru 0x8616c38 (0x621f bytes) 2012-11-24T19:42:25.741 INFO:teuthology.orchestra.run.out:1145 write 0x1be69c10 thru 0x1be7108f (0x7480 bytes) 2012-11-24T19:42:25.743 INFO:teuthology.orchestra.run.out:1146 flatten
CommandCrashedError: Command crashed: 'CEPH_CONF=/tmp/cephtest/ceph.conf LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/test_librbd_fsx -d -W -R -p 100 -P /tmp/cephtest/archive -t 1 -l 1073741824 -S 0 -N 2000 pool_client.0 image_client.0'
Updated by Tamilarasi muthamizhan over 11 years ago
recent logs: ubuntu@teuthology:/a/sage-2012-11-25_20:49:20-regression-next-master-basic/4332
Updated by Dan Mick over 11 years ago
I've reproduced with logs, and am examining the ops. Working theory is that the flatten happened between the first phase of write ('check for child object with guard') and the second ('copyup'), and that the second phase just needs to validate
parent is still present, and restart if not.
Updated by Dan Mick over 11 years ago
I think I found the smoking gun. Two completions are racing to update partial
in add_partial_sparse_result(); there's no locking protecting partial. This leads
to one of them being lost. Testing now just acquiring and releasing the
AioCompletion lock around the add_partial_sparse_result call to see if the theory is correct.
Updated by Dan Mick over 11 years ago
- Status changed from In Progress to Resolved
commit:41e16a3b40efb80a5ed7a5587438569ca86c85a3
Updated by Tamilarasi muthamizhan over 11 years ago
- Status changed from Resolved to In Progress
recent log :
ubuntu@teuthology:/a/teuthology-2012-12-09_19:00:03-regression-master-testing-gcov/10977
Updated by Josh Durgin over 11 years ago
- Status changed from In Progress to Resolved
That's a different bug. Created #3600 to track it.
Updated by Josh Durgin over 11 years ago
- Status changed from Resolved to 12
Sam saw this come up again in: ubuntu@teutholog:/a/sam-ooo3/19022
It's a different cause of the same symptom. In this case, there are no writes in flight to the clone. The last few operations (with a bunch of osd thrashing omitted) are:
2012-12-20T13:07:12.811 INFO:teuthology.orchestra.run.out:702 clone 3 order 20 su 4096 sc 4 2012-12-20T13:07:23.733 INFO:teuthology.orchestra.run.out:checking clone #1, image image_client.0-clone1 against file /tmp/cephtest/archive/fsx-image_client.0-parent2 2012-12-20T13:08:22.578 INFO:teuthology.orchestra.run.out:703 write 0x16bb8b48 thru 0x16bbec61 (0x611a bytes) 2012-12-20T13:08:22.619 INFO:teuthology.orchestra.run.out:705 trunc from 0x1dc360f4 to 0x7c00952 2012-12-20T13:11:37.601 INFO:teuthology.orchestra.run.out:706 read 0x4447088 thru 0x444edfc (0x7d75 bytes) 2012-12-20T13:11:51.626 INFO:teuthology.orchestra.run.out:709 clone 4 order 23 su 262144 sc 12 2012-12-20T13:12:08.125 INFO:teuthology.orchestra.run.out:checking clone #2, image image_client.0-clone2 against file /tmp/cephtest/archive/fsx-image_client.0-parent3 2012-12-20T13:12:15.747 INFO:teuthology.orchestra.run.out:712 flatten
Note that the trunc operation causes a cache invalidation, and this returned an error in the client log, indicating that there were some non-clean buffers that could not be released:
2012-12-20 13:25:01.308682 7f1a67748780 -1 librbd::ImageCtx: could not release all objects from cache
Updated by Dan Mick about 11 years ago
- Assignee changed from Dan Mick to Josh Durgin