Bug #3810: btrfs corrupts file size on 3.7 - Ceph - Ceph

Actions

Copy link

Bug #3810

closed

btrfs corrupts file size on 3.7

Added by Mike Lowe over 11 years ago. Updated about 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After creating a new ceph cluster pg's become inconsistent after using the qemu client. Logs indicate that the primary osd may be truncating objects. Object sizes on the primary and secondary do not match in the file store, the primary being smaller. Debugging logs will be uploaded via scp to ceph.com as jmlowelogs1.tgz and jmlowelogs2.tgz.

Files

object_log.txt (138 KB) object_log.txt

Sage Weil, 01/28/2013 03:41 PM

Actions

Copy link

Updated by Dan Mick over 11 years ago

Assignee set to Samuel Just

Actions

Copy link

Updated by Samuel Just over 11 years ago

Status changed from New to Need More Info

I need a dump of the xattrs on the d0c18e1d/605.00000000/head//1 object in pg 1.1d on osd 7 and osd 0

Actions

Copy link

Updated by Sage Weil about 11 years ago

Assignee deleted (~~Samuel Just~~)

Actions

Copy link

Updated by Mike Lowe about 11 years ago

Additional info, btrfs scrubs were done while the osd's were active which may or may not have had a negative effect.
Kernel from here http://kernel.ubuntu.com/~kernel-ppa/mainline/

Linux version 3.7.0-030700-generic (root@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201212102335 SMP Tue Dec 11 04:36:24 UTC 2012

Actions

Copy link

Updated by Mike Lowe about 11 years ago

Samuel Just wrote:

I need a dump of the xattrs on the d0c18e1d/605.00000000/head//1 object in pg 1.1d on osd 7 and osd 0

The pg seems empty, are you sure that isn't from the earlier incarnation of the cluster? There is some overlap in the logs so you can trace the full life cycle of all of the objects.

ls -la /data/osd.7/current/1.1d_head
total 0
drwxr-xr-x 1 root root 0 Jan 13 15:09 .
drwxr-xr-x 1 root root 6620 Jan 23 13:57 ..

ls -la /data/osd.0/current/1.1d_head
total 0
drwxr-xr-x 1 root root 0 Jan 13 15:09 .
drwxr-xr-x 1 root root 12270 Jan 15 20:12 ..

Actions

Copy link

Updated by Sage Weil about 11 years ago

the going theory is that this is triggered by btrfs scrub. can we confirm this somehow?

Actions

Copy link

Updated by Mike Lowe about 11 years ago

How about this object instead:
2013-01-23 18:41:31.336722 osd.7 149.165.228.11:6800/28046 159 : [ERR] 2.202 osd.0: soid f0fb3a02/rbd_data.10852ae8944a.0000000000000e83/head//2 size 4194304 != known size 4141056, digest 2695173195 != known digest 3906316306
2013-01-23 18:41:31.347478 osd.7 149.165.228.11:6800/28046 160 : [ERR] deep-scrub 2.202 f0fb3a02/rbd_data.10852ae8944a.0000000000000e83/head//2 on disk size (4141056) does not match object info size (4194304)

/data/osd.0/current/2.202_head/DIR_2/DIR_0/DIR_A# ~/listxattrs rbd\\udata.10852ae8944a.0000000000000e83__head_F0FB3A02__2
ceph._
0000000: 0a08 e700 0000 0403 4700 0000 0000 0000 ........G.......
0000010: 2600 0000 7262 645f 6461 7461 2e31 3038 &...rbd_data.108
0000020: 3532 6165 3839 3434 612e 3030 3030 3030 52ae8944a.000000
0000030: 3030 3030 3030 3065 3833 feff ffff ffff 0000000e83......
0000040: ffff 023a fbf0 0000 0000 0002 0000 0000 ...:............
0000050: 0000 0004 0310 0000 0002 0000 0000 0000 ................
0000060: 00ff ffff ff00 0000 0000 0000 007c 0200 .............|..
0000070: 0000 0000 0008 0000 0077 0200 0000 0000 .........w......
0000080: 0008 0000 0002 0215 0000 0008 6c10 0000 ............l...
0000090: 0000 0000 6731 0100 0000 0000 0000 0000 ....g1..........
00000a0: 0000 4000 0000 0000 a5c0 f450 18bc 800f ..@........P....
00000b0: 0202 1500 0000 0000 0000 0000 0000 0000 ................
00000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000e0: 7c02 0000 0000 0000 0800 0000 00 |............
ceph.snapset
0000000: 0202 1900 0000 0000 0000 0000 0000 0100 ................
0000010: 0000 0000 0000 0000 0000 0000 0000 00 ...............

/data/osd.7/current/2.202_head/DIR_2/DIR_0/DIR_A# ~/listxattrs rbd\\udata.10852ae8944a.0000000000000e83__head_F0FB3A02__2
ceph._
0000000: 0a08 e700 0000 0403 4700 0000 0000 0000 ........G.......
0000010: 2600 0000 7262 645f 6461 7461 2e31 3038 &...rbd_data.108
0000020: 3532 6165 3839 3434 612e 3030 3030 3030 52ae8944a.000000
0000030: 3030 3030 3030 3065 3833 feff ffff ffff 0000000e83......
0000040: ffff 023a fbf0 0000 0000 0002 0000 0000 ...:............
0000050: 0000 0004 0310 0000 0002 0000 0000 0000 ................
0000060: 00ff ffff ff00 0000 0000 0000 007c 0200 .............|..
0000070: 0000 0000 0008 0000 0077 0200 0000 0000 .........w......
0000080: 0008 0000 0002 0215 0000 0008 6c10 0000 ............l...
0000090: 0000 0000 6731 0100 0000 0000 0000 0000 ....g1..........
00000a0: 0000 4000 0000 0000 a5c0 f450 18bc 800f ..@........P....
00000b0: 0202 1500 0000 0000 0000 0000 0000 0000 ................
00000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000e0: 7c02 0000 0000 0000 0800 0000 00 |............
ceph.snapset
0000000: 0202 1900 0000 0000 0000 0000 0000 0100 ................
0000010: 0000 0000 0000 0000 0000 0000 0000 00 ...............

Actions

Copy link

Updated by Bill Kenworthy about 11 years ago

I have been hit by the same thing ... is there any information you need before I try and fix it further.

Ive tried repair and deep-scrub so far with no fix ... is there a suggested way or just kill the osd and recreate it?

2013-01-25 10:17:20.287622 osd.3 [INF] 0.10 repair ok, 0 fixed
2013-01-25 10:17:20.315800 osd.3 [INF] 0.64 repair ok, 0 fixed
2013-01-25 10:17:22.148984 osd.3 [ERR] repair 3.1 188da1b1/rb.0.1862.6b8b4567.0000000006a7/head//3 on disk size (4128768) does not match object info size (4194304)
2013-01-25 10:17:28.484629 mon.0 [INF] pgmap v32355: 776 pgs: 776 active+clean; 49128 MB data, 116 GB used, 1153 GB / 1280 GB avail
2013-01-25 10:17:24.807691 osd.3 [ERR] repair 3.2 293764aa/rb.0.1862.6b8b4567.000000000222/head//3 on disk size (4182016) does not match object info size (4194304)
2013-01-25 10:17:25.481110 osd.3 [INF] 3.3 repair ok, 0 fixed
2013-01-25 10:17:25.550800 osd.3 [ERR] 3.1 repair 1 errors, 0 fixed
2013-01-25 10:17:25.595865 osd.3 [ERR] 3.2 repair 1 errors, 0 fixed
2013-01-25 10:17:25.636671 osd.3 [INF] 3.4 repair ok, 0 fixed

Actions

Copy link

Updated by Sage Weil about 11 years ago

Bill Kenworthy wrote:

I have been hit by the same thing ... is there any information you need before I try and fix it further.

Ive tried repair and deep-scrub so far with no fix ... is there a suggested way or just kill the osd and recreate it?

2013-01-25 10:17:20.287622 osd.3 [INF] 0.10 repair ok, 0 fixed
2013-01-25 10:17:20.315800 osd.3 [INF] 0.64 repair ok, 0 fixed
2013-01-25 10:17:22.148984 osd.3 [ERR] repair 3.1 188da1b1/rb.0.1862.6b8b4567.0000000006a7/head//3 on disk size (4128768) does not match object info size (4194304)
2013-01-25 10:17:28.484629 mon.0 [INF] pgmap v32355: 776 pgs: 776 active+clean; 49128 MB data, 116 GB used, 1153 GB / 1280 GB avail
2013-01-25 10:17:24.807691 osd.3 [ERR] repair 3.2 293764aa/rb.0.1862.6b8b4567.000000000222/head//3 on disk size (4182016) does not match object info size (4194304)
2013-01-25 10:17:25.481110 osd.3 [INF] 3.3 repair ok, 0 fixed
2013-01-25 10:17:25.550800 osd.3 [ERR] 3.1 repair 1 errors, 0 fixed
2013-01-25 10:17:25.595865 osd.3 [ERR] 3.2 repair 1 errors, 0 fixed
2013-01-25 10:17:25.636671 osd.3 [INF] 3.4 repair ok, 0 fixed

Bill- Can you tell us what fs you were using (xfs, btrfs) and what the rbd workload looked like? Did anything else happen to the cluster (node crashes etc.)? What version are you running? Was it always that version, or did you upgrade from something older?

To repair, find the object in $osd_data/current/3.1_head/* and 'truncate --size 4194304 $filename'. You should do both replicas, or just the one on the primary osd, and then repair to push that copy to the replica.

Actions

Copy link

#10

Updated by Bill Kenworthy about 11 years ago

Version was 55.1 when created and the error occurred, now updated to 56.1 (on gentoo) after error
Its organised as 5 osd's, all as 256G btrfs partitions on a single 2T disk, with 3 mon's and 3 msd's ... idea is to set up a mini data centre managed by libvirsh and expand as I create/move vm's across from my old system and shift disks (and therefore reorganise the osd's) as the disks become available. Result is it is on a dual quad core xeon with 16G of ram but quite resource constrained.

The error first showed when the cluster was scrubbing (as in was printed by ceph -w) As for when, the processors get very unresponsive on deletes via rbd so that may have been a cause (as I removed some images shortly before that time).

I'll have a go at the repair ... I presume that means the data in that pg is corrupted?

BillK

Actions

Copy link

#11

Updated by Sage Weil about 11 years ago

Bill Kenworthy wrote:

Version was 55.1 when created and the error occurred, now updated to 56.1 (on gentoo) after error
Its organised as 5 osd's, all as 256G btrfs partitions on a single 2T disk, with 3 mon's and 3 msd's ... idea is to set up a mini data centre managed by libvirsh and expand as I create/move vm's across from my old system and shift disks (and therefore reorganise the osd's) as the disks become available. Result is it is on a dual quad core xeon with 16G of ram but quite resource constrained.

The error first showed when the cluster was scrubbing (as in was printed by ceph -w) As for when, the processors get very unresponsive on deletes via rbd so that may have been a cause (as I removed some images shortly before that time).

I'll have a go at the repair ... I presume that means the data in that pg is corrupted?

BillK

Did you ever do any btrfs scrubbing? Which kernel version?

That one object was truncated, so at least a few bytes are missing (now zeroes) in that RBD image. Chances are it is some file written around that point in time; you should probably track down which image it is and fsck it to be safe. You can figure out which by doing 'rbd info' on each image and seeing which one has prefix rb.0.1862.6b8b4567.0000000006a7.

Actions

Copy link

#12

Updated by Bill Kenworthy about 11 years ago

Kernel was 3.7.1
Ran btrfsck on the partitions when the error first occurred with nothing found.

Tried your fix of truncating the files - didnt work because as soon as I restarted ceph the files were recreated with the wrong size (even if I deleted the secondary).

Was able to locate the image from your info above, created a copy and deleted the original. The next repair fixed an inconsistency error and now everything looks ok.

Thanks,
BillK

Actions

Copy link

#13

Updated by Mike Lowe about 11 years ago

I deleted the rbd's with inconsistent pg's, recreated the rbd's, ran rsync with the same data set, made sure no btrfs scrub ran, and I have 44 inconsistent pg's again. I no longer believe it is related to btrfs scrub.

Actions

Copy link

#14

Updated by Sage Weil about 11 years ago

Mike, would it be possible to reproduce this with debug file store = 20? That will tell us if what Ceph thinks it did explains the later inconsistency, or if its btrfs getting confused... And if so, will point us to what code may be responsible.

Actions

Copy link

#15

Updated by Mike Lowe about 11 years ago

I can do that, it will take somewhere between 12 and 24 hours to run.

Actions

Copy link

#16

Updated by Mike Lowe about 11 years ago

I ran part of the workload and found an inconsistent pg. I've uploaded ceph.log and logs from the primary and secondary osd's as jmlowelogs-3.tgz.

Actions

Copy link

#17

Updated by Sage Weil about 11 years ago

File object_log.txt object_log.txt added

Ok, this looks like a btrfs bug to me. On osd.3, the write extends the file size to 4194304, but the later stat sees 4186112.

2013-01-28 11:55:53.170640 7f46fb667700 15 filestore(/data/osd.3) write 2.1fe_head/1a941ffe/rb.0.1329.238e1f29.0000000017bf/head//2 4185600~8704
2013-01-28 11:55:53.170689 7f46fb667700 10 filestore(/data/osd.3) write 2.1fe_head/1a941ffe/rb.0.1329.238e1f29.0000000017bf/head//2 4185600~8704 = 8704
... more writes, only writes ...
2013-01-28 13:25:54.095103 7f46f2655700 10 filestore(/data/osd.3) stat 2.1fe_head/1a941ffe/rb.0.1329.238e1f29.0000000017bf/head//2 = 0 (size 4186112)

Note that osd.2 sees the same thing:

2013-01-28 11:55:53.059862 7f944b8a9700 15 filestore(/data/osd.2) write 2.1fe_head/1a941ffe/rb.0.1329.238e1f29.0000000017bf/head//2 4185600~8704
2013-01-28 11:55:53.059910 7f944b8a9700 10 filestore(/data/osd.2) write 2.1fe_head/1a941ffe/rb.0.1329.238e1f29.0000000017bf/head//2 4185600~8704 = 8704
...
2013-01-28 13:25:54.033343 7f9442897700 10 filestore(/data/osd.2) stat 2.1fe_head/1a941ffe/rb.0.1329.238e1f29.0000000017bf/head//2 = 0 (size 4194304)

and gets the correct file size.

My guess is that the write is written inline in the btree because it is not adjacent to other data extends, and the conversion later does not get the file size correct somehow.

I've attached the log for just IO to that object (osd.2 and osd.3 combined), sorted by time.

Actions

Copy link

#18