Project

General

Profile

Actions

Bug #2535

closed

rbd: random data corruption in vm

Added by Sage Weil almost 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From ML:

Date: Thu, 07 Jun 2012 20:04:09 +0200
From: Guido Winkelmann <guido-ceph@thisisnotatest.de>
To: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Random data corruption in VM, possibly caused by rbd
Parts/Attachments:
   1 Shown     47 lines  Text
   2          1.4 KB     Application
----------------------------------------

Hi,

I'm using Ceph with RBD to provide network-transparent disk images for KVM-
based virtual servers. The last two days, I've been hunting some weird elusive 
bug where data in the virtual machines would be corrupted in weird ways. It 
usually manifests in files having some random data - usually zeroes - at the 
start before the actual contents that should be in there start.

To track this down, I wrote a simple io tester. It does the following:

- Create 1 Megabyte of random data
- Calculate the SHA256 hash of that data
- Write the data to a file on the harddisk, in a given directory, using the 
hash as the filename
- Repeat until the disk is full
- Delete the last file (because it is very likely to be incompletely written)
- Read and delete all the files just written while checking that their sha256 
sums are equal to their filenames

When running this io tester in a VM that uses a qcow2 file on a local harddisk 
for its virtual disk, no errors are found. When the same VM is running using 
rbd, the io tester finds on average about one corruption every 200 Megabytes, 
reproducably.

(As in an interesting aside, the io tester also prints how long it took to 
read or write 100 MB, and it turns out reading the data back in again is about 
three times slower than writing them in the first place...)

Ceph is version 0.47.2. Qemu KVM is 1.0, compiled with the spec file from 
http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=summary
(And compiled after ceph 0.47.2 was installed on that machine, so it would use 
the correct headers...)
Both the Ceph cluster and the KVM host machines are running on Fedora 16, with 
a fairly recent 3.3.x kernel.
The ceph cluster uses btrf for the osd's data dirs. The journal is on a tmpfs. 
(This is not a production setup - luckily.)
The virtual machine is using ext4 as its filesystem.
There were no obvious other problems with either the ceph cluster or the KVM 
host machines.

I have attached a copy of the ceph.conf in use, in case it might be helpful.

This is a huge problem, and any help in tracking it down would be much 
appreciated.

Regards,

Happens with rbd cache off, but not with it on.

Oliver sees this too when he turns off the cache.

http://ceph.com/qa/iotester.tgz
(precise compiled version at http://ceph.com/qa/iotester)
(precise compiled fiemap at http://ceph.com/qa/fiemap)

I was unable to reproduce this with:
- precise 3.2.0-23 ceph, btrfs w/ 1 replica, or ext4 w/ 2 replicas
- precise 3.2.0-23 host
- precise 3.2.0-23 8GB guest, ext4
(but it was my laptop, ssd, all one machine, so timing may be different)


Files

rb.0.13.00000000045a.block.bz2 (421 KB) rb.0.13.00000000045a.block.bz2 Guido Winkelmann, 06/12/2012 05:32 AM
osd.2.log.bz2 (16.4 MB) osd.2.log.bz2 Guido Winkelmann, 06/12/2012 05:32 AM
Actions #1

Updated by Sage Weil almost 12 years ago

The information that should let us fully diagnose:

  • set

    debug osd = 20
    debug filestore = 20
    debug ms = 1

and restart osds (and maybe rotate the logs).

  • compile/prepare fiemap tool inside vm
  • run the iotester tool to failure
  • run fiemap on the bad file
  • provide fiemap output and osd log files
Actions #2

Updated by Guido Winkelmann almost 12 years ago

The bug also does not seem to have any effect with the setting "filestore fiemap = false" in ceph.conf.

Updated by Guido Winkelmann almost 12 years ago

Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:

If you can reproduce it with 'debug filestore = 20' too, that will be
better, as it will tell us what the FIEMAP ioctl is returning.

I ran another testrun with 'debug filestore = 20'.

Also, if
you can attach/post the contents of the object itself (rados -p rbd get
rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
data (and the sparse-read operation that librbd is doing is the culprit).

I tried that, with the block name that the steps further below gave me:

rados -p rbd get rb.0.13.00000000045a block

When I looked into the block, it looked like a bunch of temp files from the portage system with padding in between, although it should be random data... I think I got the wrong block after all...

Here's what I did:
Run the iotester again:
testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
Tue Jun 12 13:51:58 CEST 2012
Wrote 100 MiB of data in 2004 milliseconds
[snip lots of irrelevant lines]
Wrote 100 MiB of data in 2537 milliseconds
Read 100 MiB of data in 3794 milliseconds
Read 100 MiB of data in 10150 milliseconds
Digest wrong for file "/var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e"
Tue Jun 12 13:55:00 CEST 2012

Run the fiemap tool on that file:

testserver-rbd11 ~ # ./fiemap /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e
File /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e has 1 extents:
  1. Logical Physical Length Flags
    0: 0000000000000000 0000000116900000 0000000000100000 0001

As for the log:

First, map the offset to an rbd block. For example, taking the 'Physical'
value of 00000000a8200000 from above:

$ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
0000000002a0

That gave me

$ printf "%012x\n" $((0x0000000116900000 / (4096*1024) ))
00000000045a

Then figure out what the object name prefix is:

$ rbd info <imagename> | grep prefix
block_name_prefix: rb.0.1

Result: block_name_prefix: rb.0.13

Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.

Result: rb.0.13.00000000045a

Then map that back to an osd with

$ ceph osd map rbd rb.0.1.0000000002a0
osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65
(2.5) -> up [0,2] acting [0,2]

That gives me
[root@storage1 ~]# ceph osd map rbd rb.0.13.00000000045a 2> /dev/null
osdmap e101 pool 'rbd' (2) object 'rb.0.13.00000000045a' -> pg 2.80b039fb (2.7b) -> up [2,1] acting [2,1]

You'll see the osd ids listed in brackets after 'active'. We want the
first one, 0 in my example. The log from that OSD is what we need.

Okay, i'm attaching the compressed log for osd.2 and the compressed block to the issue report in the redmine.

Actions #4

Updated by Sage Weil almost 12 years ago

Are there multiple partitions or is LVM on the disk, or is the file system on the raw device?

Actions #5

Updated by Guido Winkelmann almost 12 years ago

Sage Weil wrote:

Are there multiple partitions or is LVM on the disk, or is the file system on the raw device?

There are multiple partitions, no LVM.

Actions #6

Updated by Sage Weil almost 12 years ago

Guido Winkelmann wrote:

Sage Weil wrote:

Are there multiple partitions or is LVM on the disk, or is the file system on the raw device?

There are multiple partitions, no LVM.

Ah. In that case, you need to add the offset (in bytes) to the beginning of the file system's partition to that physical offset from fiemap. e.g. take the 'fdisk /dev/foo -l' Start sector value * 512, then recalculate the rbd block number and object name...

Actions #7

Updated by Guido Winkelmann almost 12 years ago

Sage Weil wrote:

Guido Winkelmann wrote:

Sage Weil wrote:

Are there multiple partitions or is LVM on the disk, or is the file system on the raw device?

There are multiple partitions, no LVM.

Ah. In that case, you need to add the offset (in bytes) to the beginning of the file system's partition to that physical offset from fiemap. e.g. take the 'fdisk /dev/foo -l' Start sector value * 512, then recalculate the rbd block number and object name...

That gives me rbd block 000000000a5b instead then.This still does no seem to conatin the correct data.

Actions #8

Updated by Guido Winkelmann almost 12 years ago

Status update:

I tried modifying the iotester so that it would work directly on the block device, in the hopes I can sidestep all those problems with filesystems and partitions in between.

This setup has so far not been able to reproduce the bug. Maybe it's because I'm filling the volume strictly sequentially and completely now, before I verify anything. I suppose I will modify it even more to do some random reads and writes.

Actions #9

Updated by Sage Weil almost 12 years ago

Let's try a different tack: I pushed a osd-verify-sparse-read-holes branch to ceph.git (based on 0.47.2) that reads the fiemap-reported holes and verifies they are filled with zeros. If they are not, it puts a warning in the cluster log ($mon_data/log.err and 'ceph -w' output). Enable this in your ceph.conf:

osd verify sparse read holes = true

Restart the osds, and reproduce the corruption, then check for errors like:

2012-06-14 12:42:02.081373 osd.0 [ERR] 2.5_head 9b62f76d/rb.0.1.000000000000/head//2 sparse-read found data in hole 109839~135168

in your log.

Actions #10

Updated by Sage Weil almost 12 years ago

Just a bit of context: rbd without caching does a 'sparse-read' operation, which uses FIEMAP to determine which parts of the object are allocated on disk, and then only reads and returns those portions to the client. I suspect that there is a FIEMAP bug in the kernel/file system you are using (we've seen such bugs in ext4, but not recently), or in our code that calls fiemap...

Actions #11

Updated by Sage Weil almost 12 years ago

Sage Weil wrote:

Just a bit of context: rbd without caching does a 'sparse-read' operation, which uses FIEMAP to determine which parts of the object are allocated on disk, and then only reads and returns those portions to the client. I suspect that there is a FIEMAP bug in the kernel/file system you are using (we've seen such bugs in ext4, but not recently), or in our code that calls fiemap...

I would expect test_librbd_fsx to do a pretty good job uncovering those problems.

Oh, which leads to another idea: can you build test_librbd_fsx, and run that against your cluster? Something like

./test_librbd_fsx poolname imagename

(where imagename doesn't exist). It would be interesting to see if that runs into problems!

Actions #12

Updated by Sage Weil almost 12 years ago

  • Priority changed from Urgent to High

We've disabled fiemap, which appears to be the culprit. Josh is still tracking down which kernel releases are affected, since we have clear evidence of two fiemap calls giving different results after fsync() and no intervening i/o.

Actions #13

Updated by Sage Weil almost 12 years ago

  • Project changed from Ceph to rbd
  • Category deleted (librbd)
Actions #14

Updated by Florian Haas over 11 years ago

Just so this doesn't fall of the table, can this be considered resolved with the changed filestore fiemap default? Or are there open questions on this that RBD users should still be concerned about?

Actions #15

Updated by Josh Durgin over 11 years ago

  • Status changed from Need More Info to Resolved

Thanks for reminding us Florian. There are no open questions that I know of, the only problem with this was fiemap, so this can be marked resolved.

Actions

Also available in: Atom PDF