krbd: running simple fsstress produces corrupt XFS file system
This does not occur with the current ceph-client/master branch:
35152979 rbd: activate v2 image support
However, when a fix is applied for this:
this problem arises. Initially, Guangliang Zhao provided a
few iterations of fixes. The last one or two I rejected because
running xfstests over rbd fails test 013.
Now I've written my own fix for bug 2852 and my test fails
as well, so I'm a bit concerned that there is another problem
that's unmasked when bug 2852 is fixed.
I have reproduced the problem and narrowed it down to running
this: (I'm trying to get even fewer operations, but this is
rbd create image1 --size=1000
rbd map image1 # assume we get /dev/rbd1
mkdir -p /m
mount /dev/rbd1 /m
mkdir -p /m/out
fsstress -r -s 1351278607 -v -m 8 -n 5 -d /m/out
xfs_repair -n /dev/rbd1
This does 5 random operations (reproducible with the given seed)
and when xfs_repair is then run it reports errors.
Here are the five operations:
0/0: rmdir - no directory
0/1: symlink l0XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0
0/1: symlink add id=0,parent=-1
0/2: mknod c1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0
0/2: mknod add id=1,parent=-1
0/3: fiemap - no filename
0/4: link l0XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX l2XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0
0/4: link add id=2,parent=-1
I'm going to try to recreate a handful of commands that produce
the same result to see if I can remove fsstress from the picture.
#1 Updated by Alex Elder almost 11 years ago
OK, I've simplified it further. The dependence on name sizes
makes me think it has to do with log blocks. We'll see.
rbd create image1 --size=1000 rbd map image1 # assume we get /dev/rbd1 mkfs.xfs /dev/rbd1 mount /dev/rbd1 /m pushd /m # The following commands cause the failure. One less # character in any name makes the failure go away. # ln -s <157-char name> <87-char name> # ln <that 87-char name> <50-char name> ln -s \ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \ yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy ln \ yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy \ zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz popd umount /dev/rbd1 xfs_repair -n /dev/rbd1 # This will show corruption # rbd unmap /dev/rbd1 # rbd rm image1
#2 Updated by Alex Elder almost 11 years ago
I think I may know what's going on, at least with my particular
fix for the bio_chain_clone(). If I'm right the problem has
to do with what I think is an invalid assumption I made in my
fix. Namely, my fix assumes all bio's in a chain are contiguous
on disk but that's not (always) true. I haven't worked backward
to see if Zhao's most recent fix has the same assumption (thus
exhibiting the same xfstests failure).
The original bio_chain_clone() function does seem to have the
same assumption, however. It splits a bio if the total bytes
transferred so far exceeds the object boundary defined by the
"len" parameter, but in fact what's important is whether the
current bio, transferring from its sector offset for its length,
would cross that boundary. If a bio chain covers a discontiguous
range of blocks that's no good.
I know how I want to fix it but will start on it when I'm
refreshed in the morning.
#4 Updated by Alex Elder almost 11 years ago
I wanted to add this detail but forgot... These are the
debug statements that I thought told the story.
[10758.806428] rbd: rbd_do_request object_name=rb.0.1011.3f1ff551.000000000000 ofs=4096 len=12288
[10758.900400] rbd: rbd_do_request object_name=rb.0.1011.3f1ff551.000000000000 ofs=4186112 len=8192
[10758.994729] rbd: rbd_do_request object_name=rb.0.1011.3f1ff551.000000000001 ofs=0 len=8192
[10759.114659] rbd: rbd_coll_end_req_index ffff88020cd3f200 index 1 ret 0 len 8192
[10759.207414] rbd: rbd_coll_end_req_index ffff88020e0ebce0 index 0 ret 0 len 12288
[10759.300011] rbd: rbd_coll_end_req_index ffff88020cd3f200 index 0 ret 0 len 8192
#5 Updated by Alex Elder almost 11 years ago
That's surprising to me that a bio might represent a discontiguous IO...
is there someone we can bounce that theory off of?
Sorry, I probably misspoke. A single bio necessarily represents
contiguous I/O. It has a single start sector.
I had been quietly assuming that a chain of bio's would be contiguous
but that's not necessarily true. Each one could start at a different
bio_chain_clone() is cloning a chain of bio's. And I think (but I
need to look at this after a good night's sleep) that it is assuming
the bio's in the chain represent contiguous regions on disk.
The thing I just added shows completions for a collection (of two)
and their sizes don't look right for the requests that should have
#6 Updated by Alex Elder almost 11 years ago
I updated my code to reflect that bio's in a chain might
not be contiguous. It wasn't quite working, though, and
I took another look at the test case and it looks like
that may not have been an issue in the set of I/O's after
So I've backed out those changes and am instead approaching
it by looking at individual I/O operations (writes in
particular), and am reproducing them using simple commands
rather than XFS. I'm hoping that I can see some write, or
combination of writes, which result in different data coming
back on a read.
#7 Updated by Alex Elder almost 11 years ago
This whole episode started after multiple attempts to
fix bio_chain_clone() so it wouldn't leak bio_pair
structures produced failures when running xfstests #13.
One or two attempts that were done by Guangliang Zhao.
The last attempt was made by me, and the failure of the
same test led to the suspicion that maybe the code that
used bio_pair was obscuring some other problem.
So I have been pursuing that problem in earnest. But
ultimately, this problem is arising with a purported
bio_pair fix in place, and that fix could well be buggy.
Now I'm stepping back further, back to the code that
has no bio_pair "fix", and am going to try to confirm
that there is in fact a problem that needs to be solved.
#8 Updated by Alex Elder almost 11 years ago
I think that the bio_pair leak won't be reproduced in
the current environment. The reason is that bio_split()
and the bio_pair it produces only comes into play if we
have a bio containing a bio_vec that spans the boundary
between two rbd objects.
But rbd objects are limited by the CLI to 4KB minimum,
4MB default. And at that minimum size, the boundary
between objects will also be a boundary between pages,
hence there will be no bio_iov that spans two objects
(because it would also have to span two pages).
On a system with a larger page size than 4KB we might
bump into this, but not with the default x86 page size.
Josh and I discussed it and agreed I would descend
once more into it, verifying and fixing the bio_pair
leak issue so it won't bite us with large page sizes.
#9 Updated by Alex Elder almost 11 years ago
- Status changed from 12 to Resolved
The failure this was opened for was due to a bug in the
implementation of bio_chain_clone_range(), which was written
to resolve bug http://tracker.newdream.net/issues/2933.
I have posted a patch for review that implements a fix
for bug 2933, and this includes a correct implementation
So I'm marking this bug resolved.