Bug #3585
closed
Image import via QEMU-IMG results in a corrupt rbd
Added by Matt Anderson over 11 years ago.
Updated over 11 years ago.
Description
This is a follow on from the mailing list topic VM Corruption on "0.54 when 'client cache = false'". After upgrading to 0.55 and doing testing on a single server it appeared to be resolved. I've now setup a third server and the corruption issue is back again. I've narrowed the issue down to what I think is a qemu-img problem. Different client cache settings don't appear to have any influence on the problem like I first thought.
I'm importing an existing image using the command : qemu-img convert -f raw -O rbd /mnt/gluster1-norep/templates/testImage.raw rbd:nearline/iHost-Test-TS8
This completes successfully in a normal amount of time but when I boot the VM it appears corrupt and the VM can't boot citing a corrupt or missing kernel. I've also tried importing a raw image via qemu-img which displays the same problem which suggests it isn't specific to the QED image format. When importing the same raw image via the RBD tool the image boots perfectly. Also, if I reinstall windows from an ISO over the top of the corrupt image the VM works perfect thereafter with no ongoing corruption which suggests that QEMU is working fine and it's just QEMU-IMG that is causing the problem.
QEMU-IMG version is 1.1.2
Ceph version is 0.55 (690f8175606edf37a3177c27a3949c78fd37099f)
Kernel version is 3.6.8-1.el6.elrepo.x86_64
My ceph.conf and a client log of the import are attached. The cluster was in a healthy state at the time.
Thanks
-Matt
Files
- Status changed from New to In Progress
- Assignee set to Josh Durgin
- Priority changed from Normal to Urgent
- Source changed from Development to Community (user)
As a workaround you can use 'rbd import file pool/image' on a raw file. Does the corrupted image show the correct size compared to the raw image? If you 'rbd export' the image, does the size and md5sum match the raw image? I'll see if I can reproduce this tomorrow.
All of the imported images are showing the exact same size. TS7 is a qemu-img import and TS6 is a rbd import. Using qemu-img info shows the correct size also.
[root@KVM04 ~]# rbd info iHost-Test-TS7 -p nearline
rbd image 'iHost-Test-TS7':
size 20480 MB in 5120 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.117f.6b8b4567
format: 1
[root@KVM04 ~]# rbd info iHost-Test-TS6 -p nearline
rbd image 'iHost-Test-TS6':
size 20480 MB in 5120 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.117d.6b8b4567
format: 1
I'll let you know the results of the md5sum soon when I get a little more free time.
Just reproduced a bad size (504 bytes less) when using qemu-img 1.1.2 to convert a 1024119288 byte file. It seems to be ignoring anything less than sector size (512 bytes) at the end. Current qemu.git also does this. Do the raw versions of your corrupted images have a file size that is not evenly divisible by 512 bytes?
It looks like qemu-img info is also reporting the size after using integer division and multiplication by 512, so it won't show the actual size if it's not 512-byte aligned. Unfortunately the rbd tool is always rounding the size in its output to be human readable, so to get the actual size in bytes you need to export the image and look at the resulting file size.
I ran
qemu-img convert -f raw -O rbd testImage.raw rbd:nearline/testImage
rbd export testImage exportImage.raw -p nearline
and resulting images are the same size (21474836480 B) but they are divisable by 512.
The MD5's -
[root@KVM04 templates]# md5sum -b testImage.raw
309bf66c9a26cb50d81d53bcf38d89bb *testImage.raw
[root@KVM04 templates]# md5sum -b exportImage.raw
ccfc0700831fb5241e435a3848167d4e *exportImage.raw
If I can run any more tests to help out just let me know.
Since the size isn't an issue, it'd be great if you could:
1) generate a log of qemu-img convert with 'rbd cache = false', 'debug rbd = 20', and 'debug ms = 1'
2) before trying to use the newly-converted image, export it to a file via rbd export
3) run 'cmp -l' on the original file and the exported version, and attach the ouput here (if it's very large, at least the first few hundred lines of differences would tell us something)
This should tell us what kind of corruption this is, and where it's happening. Turning off the cache for the conversion just simplifies things a bit by removing an extra layer.
- Priority changed from Urgent to High
Attached files as requested.
Compare was stopped early to save on file size.
Thanks for the logs. All the differences there are zeroes where actual data should be, but the librbd debug log shows nothing being written to missing sections (i.e. librbd never even sees the missing data). This suggests that qemu-img or the rbd qemu driver is causing the problem. Does this still occur with a later version of qemu-img? What about when converting to other formats, e.g. raw -> qcow2 -> raw?
This seems to be fixed in QEMU 1.3.0 and Ceph 0.56.1
I've tried QED -> Raw -> Ceph -> Raw then QED -> Ceph -> Raw and both have the same MD5 checksum. I've only imported a single VM at the moment but appears to be running flawlessy. I'll report back if I run across the issue again but it should be all sorted now.
Thanks again for all the assistance Josh.
- Status changed from In Progress to Closed
Great, glad to hear it's fixed.
Also available in: Atom
PDF