Project

General

Profile

Actions

Bug #376

closed

File corruption after cluster crashes

Added by Wido den Hollander over 13 years ago. Updated about 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A few days ago my cluster lost 5 of the 12 OSD's due to #371 (one OSD) and #367 (four OSD's).

After #367 was fixed (took about 36 hours) I was able to get my cluster back to 11 OSD's, but then it wouldn't fully recover, this was due to bug #374

In total the cluster was degraded for about 72 hours, in this time I did not make any I/O's (especially write) to the cluster.

When checking the file integrety i found out that the checksum of various (large) files didn't match anymore.

A few comparisons:

root@logger:/srv/ceph/iso# md5sum ubuntu-6.06.2-server-i386.iso /mnt/ceph/static/ubuntu/.pool/ubuntu-6.06.2-server-i386.iso
1bf5938af8a61b8de749c43ba8071181  ubuntu-6.06.2-server-i386.iso
f43706f26a22050a8e91962cf44dcbbd  /mnt/ceph/static/ubuntu/.pool/ubuntu-6.06.2-server-i386.iso
root@logger:/srv/ceph/iso#
root@logger:/srv/ceph/iso# md5sum ubuntu-9.04-alternate-i386.iso /mnt/ceph/static/ubuntu/.pool/ubuntu-9.04-alternate-i386.isoc564ae16dffb51a922aef74a07250473  ubuntu-9.04-alternate-i386.iso
66ee645951c47698aea1a0a7a5be3815  /mnt/ceph/static/ubuntu/.pool/ubuntu-9.04-alternate-i386.iso
root@logger:/srv/ceph/iso# cat /mnt/ceph/static/ubuntu/9.04/MD5SUMS|grep ubuntu-9.04-alternate-i386.iso
c564ae16dffb51a922aef74a07250473 *ubuntu-9.04-alternate-i386.iso
root@logger:/srv/ceph/iso# 

Here you can see both these ISO's have been corrupted.

It seems that only large files were effected, as far as I could see, small files (like HTML and TXT) were not corrupted.

I then checked a few objects added through the RADOS Gateway:

Bucket: thesimpsons
  • The.Simpsons.0101.Simpsons.Roasting.on.an.Open.Fire.avi: OK
  • The.Simpsons.0105.Bart.the.General.avi: OK
  • The.Simpsons.2119.The.Squirt.And.The.Whale.avi: OK
  • The.Simpsons.2016.Eeny.Teeny.Maya.Moe.avi: OK
Bucket: southpark
  • South.Park.0102.Weight.Gain.4000.avi: OK
  • South.Park.0207.City.on.the.Edge.of.Forever.avi: OK
  • South.Park.1407.Crippled.Summer.avi: OK

As you can see, i wasn't able to find any corrupted objects which have been added through the RADOS gateway.

I also have three RBD devices:
  • alpha
  • beta
  • charlie

These all have one virtual machine (Ubuntu 10.04) on them. After the cluster recovery both alpha and charlie would start, but beta got stuck at grub:

Booting from Hard Disk...
error: file not found.

I haven't check the RBD further, but I am sure beta worked before this 'crash'

The clients on which I performed the tests for comparing the files were all running the unstable branch commit d7cb963f7b4e0a0ccc49673b7c3ef595e81122a4

This with kernel 2.6.35 from Ubuntu ( 2.6.35-15-server )

Actions #1

Updated by Sage Weil over 13 years ago

Okay, I've determined that it's not so much corruption as missing objects (the corrupt bits are holes/zeros).

And we don't have detailed logging for that period. :(

Actions #2

Updated by Sage Weil over 13 years ago

  • Status changed from New to Can't reproduce
Actions #3

Updated by longguang yue about 13 years ago

if the bug is resolved?

Actions

Also available in: Atom PDF