Bug #376
closedFile corruption after cluster crashes
0%
Description
A few days ago my cluster lost 5 of the 12 OSD's due to #371 (one OSD) and #367 (four OSD's).
After #367 was fixed (took about 36 hours) I was able to get my cluster back to 11 OSD's, but then it wouldn't fully recover, this was due to bug #374
In total the cluster was degraded for about 72 hours, in this time I did not make any I/O's (especially write) to the cluster.
When checking the file integrety i found out that the checksum of various (large) files didn't match anymore.
A few comparisons:
root@logger:/srv/ceph/iso# md5sum ubuntu-6.06.2-server-i386.iso /mnt/ceph/static/ubuntu/.pool/ubuntu-6.06.2-server-i386.iso 1bf5938af8a61b8de749c43ba8071181 ubuntu-6.06.2-server-i386.iso f43706f26a22050a8e91962cf44dcbbd /mnt/ceph/static/ubuntu/.pool/ubuntu-6.06.2-server-i386.iso root@logger:/srv/ceph/iso#
root@logger:/srv/ceph/iso# md5sum ubuntu-9.04-alternate-i386.iso /mnt/ceph/static/ubuntu/.pool/ubuntu-9.04-alternate-i386.isoc564ae16dffb51a922aef74a07250473 ubuntu-9.04-alternate-i386.iso 66ee645951c47698aea1a0a7a5be3815 /mnt/ceph/static/ubuntu/.pool/ubuntu-9.04-alternate-i386.iso root@logger:/srv/ceph/iso# cat /mnt/ceph/static/ubuntu/9.04/MD5SUMS|grep ubuntu-9.04-alternate-i386.iso c564ae16dffb51a922aef74a07250473 *ubuntu-9.04-alternate-i386.iso root@logger:/srv/ceph/iso#
Here you can see both these ISO's have been corrupted.
It seems that only large files were effected, as far as I could see, small files (like HTML and TXT) were not corrupted.
I then checked a few objects added through the RADOS Gateway:
Bucket: thesimpsons- The.Simpsons.0101.Simpsons.Roasting.on.an.Open.Fire.avi: OK
- The.Simpsons.0105.Bart.the.General.avi: OK
- The.Simpsons.2119.The.Squirt.And.The.Whale.avi: OK
- The.Simpsons.2016.Eeny.Teeny.Maya.Moe.avi: OK
- South.Park.0102.Weight.Gain.4000.avi: OK
- South.Park.0207.City.on.the.Edge.of.Forever.avi: OK
- South.Park.1407.Crippled.Summer.avi: OK
As you can see, i wasn't able to find any corrupted objects which have been added through the RADOS gateway.
I also have three RBD devices:- alpha
- beta
- charlie
These all have one virtual machine (Ubuntu 10.04) on them. After the cluster recovery both alpha and charlie would start, but beta got stuck at grub:
Booting from Hard Disk... error: file not found.
I haven't check the RBD further, but I am sure beta worked before this 'crash'
The clients on which I performed the tests for comparing the files were all running the unstable branch commit d7cb963f7b4e0a0ccc49673b7c3ef595e81122a4
This with kernel 2.6.35 from Ubuntu ( 2.6.35-15-server )
Updated by Sage Weil over 13 years ago
Okay, I've determined that it's not so much corruption as missing objects (the corrupt bits are holes/zeros).
And we don't have detailed logging for that period. :(
Updated by Sage Weil over 13 years ago
- Status changed from New to Can't reproduce