Project

General

Profile

Actions

Bug #3585

closed

Image import via QEMU-IMG results in a corrupt rbd

Added by Matt Anderson over 11 years ago. Updated over 11 years ago.

Status:
Closed
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is a follow on from the mailing list topic VM Corruption on "0.54 when 'client cache = false'". After upgrading to 0.55 and doing testing on a single server it appeared to be resolved. I've now setup a third server and the corruption issue is back again. I've narrowed the issue down to what I think is a qemu-img problem. Different client cache settings don't appear to have any influence on the problem like I first thought.

I'm importing an existing image using the command : qemu-img convert -f raw -O rbd /mnt/gluster1-norep/templates/testImage.raw rbd:nearline/iHost-Test-TS8

This completes successfully in a normal amount of time but when I boot the VM it appears corrupt and the VM can't boot citing a corrupt or missing kernel. I've also tried importing a raw image via qemu-img which displays the same problem which suggests it isn't specific to the QED image format. When importing the same raw image via the RBD tool the image boots perfectly. Also, if I reinstall windows from an ISO over the top of the corrupt image the VM works perfect thereafter with no ongoing corruption which suggests that QEMU is working fine and it's just QEMU-IMG that is causing the problem.

QEMU-IMG version is 1.1.2
Ceph version is 0.55 (690f8175606edf37a3177c27a3949c78fd37099f)
Kernel version is 3.6.8-1.el6.elrepo.x86_64

My ceph.conf and a client log of the import are attached. The cluster was in a healthy state at the time.

Thanks
-Matt


Files

ceph.conf (4.75 KB) ceph.conf Matt Anderson, 12/07/2012 12:21 AM
ceph.client.log (28.5 MB) ceph.client.log Ceph client log of the command 'qemu-img convert -f raw -O rbd /mnt/gluster1-norep/templates/testImage.raw rbd:nearline/iHost-Test-TS8' Matt Anderson, 12/07/2012 12:21 AM
client.admin.log (22.6 MB) client.admin.log Matt Anderson, 12/13/2012 11:48 PM
compare.txt (15 MB) compare.txt Matt Anderson, 12/13/2012 11:48 PM
Actions #1

Updated by Josh Durgin over 11 years ago

  • Status changed from New to In Progress
  • Assignee set to Josh Durgin
  • Priority changed from Normal to Urgent
  • Source changed from Development to Community (user)

As a workaround you can use 'rbd import file pool/image' on a raw file. Does the corrupted image show the correct size compared to the raw image? If you 'rbd export' the image, does the size and md5sum match the raw image? I'll see if I can reproduce this tomorrow.

Actions #2

Updated by Matt Anderson over 11 years ago

All of the imported images are showing the exact same size. TS7 is a qemu-img import and TS6 is a rbd import. Using qemu-img info shows the correct size also.

[root@KVM04 ~]# rbd info iHost-Test-TS7 -p nearline
rbd image 'iHost-Test-TS7':
    size 20480 MB in 5120 objects
    order 22 (4096 KB objects)
    block_name_prefix: rb.0.117f.6b8b4567
    format: 1
[root@KVM04 ~]# rbd info iHost-Test-TS6 -p nearline
rbd image 'iHost-Test-TS6':
    size 20480 MB in 5120 objects
    order 22 (4096 KB objects)
    block_name_prefix: rb.0.117d.6b8b4567
    format: 1

I'll let you know the results of the md5sum soon when I get a little more free time.
Actions #3

Updated by Josh Durgin over 11 years ago

Just reproduced a bad size (504 bytes less) when using qemu-img 1.1.2 to convert a 1024119288 byte file. It seems to be ignoring anything less than sector size (512 bytes) at the end. Current qemu.git also does this. Do the raw versions of your corrupted images have a file size that is not evenly divisible by 512 bytes?

Actions #4

Updated by Josh Durgin over 11 years ago

It looks like qemu-img info is also reporting the size after using integer division and multiplication by 512, so it won't show the actual size if it's not 512-byte aligned. Unfortunately the rbd tool is always rounding the size in its output to be human readable, so to get the actual size in bytes you need to export the image and look at the resulting file size.

Actions #5

Updated by Matt Anderson over 11 years ago

I ran

qemu-img convert -f raw -O rbd testImage.raw rbd:nearline/testImage
rbd export testImage exportImage.raw -p nearline

and resulting images are the same size (21474836480 B) but they are divisable by 512.
The MD5's -
[root@KVM04 templates]# md5sum -b testImage.raw 
309bf66c9a26cb50d81d53bcf38d89bb *testImage.raw
[root@KVM04 templates]# md5sum -b exportImage.raw 
ccfc0700831fb5241e435a3848167d4e *exportImage.raw

If I can run any more tests to help out just let me know.

Actions #6

Updated by Josh Durgin over 11 years ago

Since the size isn't an issue, it'd be great if you could:

1) generate a log of qemu-img convert with 'rbd cache = false', 'debug rbd = 20', and 'debug ms = 1'
2) before trying to use the newly-converted image, export it to a file via rbd export
3) run 'cmp -l' on the original file and the exported version, and attach the ouput here (if it's very large, at least the first few hundred lines of differences would tell us something)

This should tell us what kind of corruption this is, and where it's happening. Turning off the cache for the conversion just simplifies things a bit by removing an extra layer.

Actions #7

Updated by Sage Weil over 11 years ago

  • Priority changed from Urgent to High

Updated by Matt Anderson over 11 years ago

Attached files as requested.
Compare was stopped early to save on file size.

Actions #9

Updated by Josh Durgin over 11 years ago

Thanks for the logs. All the differences there are zeroes where actual data should be, but the librbd debug log shows nothing being written to missing sections (i.e. librbd never even sees the missing data). This suggests that qemu-img or the rbd qemu driver is causing the problem. Does this still occur with a later version of qemu-img? What about when converting to other formats, e.g. raw -> qcow2 -> raw?

Actions #10

Updated by Matt Anderson over 11 years ago

This seems to be fixed in QEMU 1.3.0 and Ceph 0.56.1
I've tried QED -> Raw -> Ceph -> Raw then QED -> Ceph -> Raw and both have the same MD5 checksum. I've only imported a single VM at the moment but appears to be running flawlessy. I'll report back if I run across the issue again but it should be all sorted now.

Thanks again for all the assistance Josh.

Actions #11

Updated by Josh Durgin over 11 years ago

  • Status changed from In Progress to Closed

Great, glad to hear it's fixed.

Actions

Also available in: Atom PDF