Project

General

Profile

Actions

Bug #3524

closed

test_librbd_fsx: crash after flatten

Added by Sage Weil over 11 years ago. Updated about 11 years ago.

Status:
Duplicate
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

#0  0x00007f5d6bd66d67 in librbd::aio_read (ictx=0x0, image_extents=..., buf=0x0, pbl=0x7f5c40006100, c=0x7f5c510ce370) at librbd/internal.cc:2589
2589    librbd/internal.cc: No such file or directory.
(gdb) bt
#0  0x00007f5d6bd66d67 in librbd::aio_read (ictx=0x0, image_extents=..., buf=0x0, pbl=0x7f5c40006100, c=0x7f5c510ce370) at librbd/internal.cc:2589
#1  0x00007f5d6bd395c9 in librbd::AioRequest::read_from_parent (this=0x7f5c400060b0, image_extents=...) at librbd/AioRequest.cc:60
#2  0x00007f5d6bd3a596 in librbd::AbstractWrite::should_complete (this=0x7f5c400060b0, r=-2) at librbd/AioRequest.cc:168
#3  0x00007f5d6bd3773b in librbd::AioRequest::complete (this=0x7f5c400060b0, r=-2) at ./librbd/AioRequest.h:37
#4  0x00007f5d6bd6524a in librbd::rados_req_cb (c=0x7f5c40009f50, arg=0x7f5c400060b0) at librbd/internal.cc:2345
#5  0x00007f5d6aecc0d1 in librados::C_AioSafe::finish (this=0x7f5d50269520, r=0) at ./librados/AioCompletionImpl.h:190
#6  0x00007f5d6af40da6 in Finisher::finisher_thread_entry (this=0x1692df8) at common/Finisher.cc:56
#7  0x00007f5d6aeb3dba in Finisher::FinisherThread::entry (this=0x1692f10) at ./common/Finisher.h:46
#8  0x00007f5d6b0dfce9 in Thread::_entry_func (arg=0x1692f10) at common/Thread.cc:41
#9  0x00007f5d69eb1e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007f5d6a1b94bd in klogctl () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb) f 0
#0  0x00007f5d6bd66d67 in librbd::aio_read (ictx=0x0, image_extents=..., buf=0x0, pbl=0x7f5c40006100, c=0x7f5c510ce370) at librbd/internal.cc:2589
2589    in librbd/internal.cc
(gdb) up
#1  0x00007f5d6bd395c9 in librbd::AioRequest::read_from_parent (this=0x7f5c400060b0, image_extents=...) at librbd/AioRequest.cc:60
60      librbd/AioRequest.cc: No such file or directory.

m_ictx->parent == NULL?

ubuntu@teuthology:/a/sage-2012-11-24_15:06:57-regression-next-testing-basic/3533

last few lines of output:

2012-11-24T19:42:25.427 INFO:teuthology.orchestra.run.out:1143 punch    from 0x3cd4251c to 0x3cd47a44, (0x5528 bytes)
2012-11-24T19:42:25.740 INFO:teuthology.orchestra.run.out:1144 write    0x8610a1a thru  0x8616c38       (0x621f bytes)
2012-11-24T19:42:25.741 INFO:teuthology.orchestra.run.out:1145 write    0x1be69c10 thru 0x1be7108f      (0x7480 bytes)
2012-11-24T19:42:25.743 INFO:teuthology.orchestra.run.out:1146 flatten

CommandCrashedError: Command crashed: 'CEPH_CONF=/tmp/cephtest/ceph.conf LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/test_librbd_fsx -d -W -R -p 100 -P /tmp/cephtest/archive -t 1 -l 1073741824 -S 0 -N 2000 pool_client.0 image_client.0'

Actions #1

Updated by Tamilarasi muthamizhan over 11 years ago

recent logs: ubuntu@teuthology:/a/sage-2012-11-25_20:49:20-regression-next-master-basic/4332

Actions #2

Updated by Dan Mick over 11 years ago

  • Assignee set to Dan Mick
Actions #3

Updated by Dan Mick over 11 years ago

I've reproduced with logs, and am examining the ops. Working theory is that the flatten happened between the first phase of write ('check for child object with guard') and the second ('copyup'), and that the second phase just needs to validate
parent is still present, and restart if not.

Actions #4

Updated by Dan Mick over 11 years ago

I think I found the smoking gun. Two completions are racing to update partial
in add_partial_sparse_result(); there's no locking protecting partial. This leads
to one of them being lost. Testing now just acquiring and releasing the
AioCompletion lock around the add_partial_sparse_result call to see if the theory is correct.

Actions #5

Updated by Dan Mick over 11 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Dan Mick over 11 years ago

  • Status changed from In Progress to Resolved

commit:41e16a3b40efb80a5ed7a5587438569ca86c85a3

Actions #7

Updated by Tamilarasi muthamizhan over 11 years ago

  • Status changed from Resolved to In Progress

recent log :
ubuntu@teuthology:/a/teuthology-2012-12-09_19:00:03-regression-master-testing-gcov/10977

Actions #8

Updated by Josh Durgin over 11 years ago

  • Status changed from In Progress to Resolved

That's a different bug. Created #3600 to track it.

Actions #9

Updated by Josh Durgin over 11 years ago

  • Status changed from Resolved to 12

Sam saw this come up again in: ubuntu@teutholog:/a/sam-ooo3/19022

It's a different cause of the same symptom. In this case, there are no writes in flight to the clone. The last few operations (with a bunch of osd thrashing omitted) are:

2012-12-20T13:07:12.811 INFO:teuthology.orchestra.run.out:702 clone     3 order 20 su 4096 sc 4
2012-12-20T13:07:23.733 INFO:teuthology.orchestra.run.out:checking clone #1, image image_client.0-clone1 against file /tmp/cephtest/archive/fsx-image_client.0-parent2
2012-12-20T13:08:22.578 INFO:teuthology.orchestra.run.out:703 write     0x16bb8b48 thru 0x16bbec61      (0x611a bytes)
2012-12-20T13:08:22.619 INFO:teuthology.orchestra.run.out:705 trunc     from 0x1dc360f4 to 0x7c00952
2012-12-20T13:11:37.601 INFO:teuthology.orchestra.run.out:706 read      0x4447088 thru  0x444edfc       (0x7d75 bytes)
2012-12-20T13:11:51.626 INFO:teuthology.orchestra.run.out:709 clone     4 order 23 su 262144 sc 12
2012-12-20T13:12:08.125 INFO:teuthology.orchestra.run.out:checking clone #2, image image_client.0-clone2 against file /tmp/cephtest/archive/fsx-image_client.0-parent3
2012-12-20T13:12:15.747 INFO:teuthology.orchestra.run.out:712 flatten

Note that the trunc operation causes a cache invalidation, and this returned an error in the client log, indicating that there were some non-clean buffers that could not be released:

2012-12-20 13:25:01.308682 7f1a67748780 -1 librbd::ImageCtx: could not release all objects from cache

Actions #10

Updated by Sage Weil over 11 years ago

  • Priority changed from Urgent to High
Actions #11

Updated by Dan Mick about 11 years ago

  • Assignee changed from Dan Mick to Josh Durgin
Actions #12

Updated by Josh Durgin about 11 years ago

  • Status changed from 12 to Duplicate

Same as #3665

Actions

Also available in: Atom PDF