Project

General

Profile

Actions

Bug #1376

closed

errant scrub stat mismatch logs after upgrade

Added by John Leach over 12 years ago. Updated over 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

upgraded from git commit #394537092d to git commit #68cbbf42c42, and after restarting the cluster I immediately saw many "scrub stat mismatch" errors:

2011-08-08 21:45:00.590610   log 2011-08-08 21:44:53.557073 osd1 10.42.77.90:6800/10375 1 : [ERR] 0.2 scrub stat mismatch, got 63/63 objects, 0/0 clones, 542633/18446744073701705641 bytes, 566/18446744073709543990 kb.
2011-08-08 21:45:01.646482   log 2011-08-08 21:44:56.610022 osd2 10.219.16.42:6801/16231 1 : [ERR] 0.6 scrub stat mismatch, got 76/76 objects, 0/0 clones, 3822237/18446744073709179549 bytes, 3772/18446744073709551292 kb.2011-08-08 21:45:03.779548   log 2011-08-08 21:44:55.965413 osd3 10.200.35.118:6800/1574 10 : [ERR] 0.15 scrub stat mismatch, got 83/83 objects, 0/0 clones, 2329158/18446744073707686470 bytes, 2321/18446744073709549841 kb.

as these came in from multi osds, and just after an upgrade (never seen them before and have done many upgrades), it looks more like a bug than real data corruption.

Cluster layout is:

2011-08-08 22:45:29.712106    pg v35386: 800 pgs: 800 active+clean; 2679 MB data, 13032 MB used, 2979 GB / 3152 GB avail
2011-08-08 22:45:29.713446   mds e91021: 2/2/2 up {0=1=up:active,1=0=up:active}
2011-08-08 22:45:29.713474   osd e102: 4 osds: 4 up, 4 in
2011-08-08 22:45:29.713520   log 2011-08-08 22:38:41.736291 osd3 10.200.35.118:6800/2525 297 : [ERR] 0.15 scrub 1 errors
2011-08-08 22:45:29.713583   mon e1: 3 mons at {0=10.126.174.94:6789/0,1=10.82.103.194:6789/0,2=10.115.202.218:6789/0}

cluster is just a test cluster, no real data and at the time of the upgrade, had no clients accessing it.

attached log of one of the osds after manually requesting a scrub (debug level 20)


Files

osd.3.log.gz (773 KB) osd.3.log.gz John Leach, 08/08/2011 03:50 PM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #1453: osd: warn on object_info_t::size != st_size when building scrub_mapResolvedJosh Durgin08/28/2011

Actions
Actions #1

Updated by Sage Weil over 12 years ago

  • Target version set to v0.35
Actions #2

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position set to 25
Actions #3

Updated by John Leach over 12 years ago

Just tried writing some data to the ceph filesystem on this cluster and got this message:

2011-08-20 19:26:24.661143   log 2011-08-20 19:16:14.568350 mds1 10.234.213.118:6800/3918 2 : [ERR] dir 20000000441.20000000441 object missing on disk; some files may be lost

not sure if it's related in any way - never seen a message like that before.

Actions #4

Updated by Greg Farnum over 12 years ago

Missing objects on disk sure make it look like data corruption. Your cluster's pretty old, right? Is it still in this state?

Actions #5

Updated by John Leach over 12 years ago

it's a few weeks old yes, but there was no other evidence of of corruption (such as filesystem corruption).

I just deleted the osd data directory on osd1, re-added it to the cluster and let it rebuild and then ran a scrub and the errors came up again.

e.g:

2011-08-26 20:59:12.580635   log 2011-08-26 20:59:05.796145 osd1 10.42.77.90:6800/10085 20 : [ERR] 0.7 scrub stat mismatch, got 1769/1769 objects, 0/0 clones, 5458110067/5449721459 bytes, 5330431/5322239 kb.
2011-08-26 20:59:12.580635   log 2011-08-26 20:59:05.796161 osd1 10.42.77.90:6800/10085 21 : [ERR] 0.7 scrub 1 errors

This is with git commit 9538e87e0 now.

Actions #6

Updated by Sage Weil over 12 years ago

I think this is caused by an old bug. scrub needs to be fixed to properly detect (and ideally repair) it. See #1453.

Actions #7

Updated by Josh Durgin over 12 years ago

  • Status changed from New to 4

If you still have this cluster around, could you try applying 8293dfabb554883a30af549447995390fafa1f62 to see whether the problem is the old bug?

Actions #8

Updated by John Leach over 12 years ago

I upgraded to get that patch, but also got the on disk filestore update patch which was buggy and broke all my osds, so I can't test this any more, sorry.

Actions #9

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.35 to v0.36
Actions #10

Updated by Sage Weil over 12 years ago

  • Status changed from 4 to Resolved

Ok. Well we're pretty sure what the inconsistency was, and we now complain about it (tho we don't repair it just yet). Making repair work is another bug, #1474.

Actions #11

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.36 to v0.35
  • Translation missing: en.field_position deleted (69)
  • Translation missing: en.field_position set to 1
  • Translation missing: en.field_position changed from 1 to 898
Actions

Also available in: Atom PDF