Bug #1376: errant scrub stat mismatch logs after upgrade - Ceph - Ceph

Actions

Copy link

Bug #1376

closed

errant scrub stat mismatch logs after upgrade

Added by John Leach over 12 years ago. Updated over 12 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

OSD

Target version:

v0.35

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

upgraded from git commit #394537092d to git commit #68cbbf42c42, and after restarting the cluster I immediately saw many "scrub stat mismatch" errors:

2011-08-08 21:45:00.590610   log 2011-08-08 21:44:53.557073 osd1 10.42.77.90:6800/10375 1 : [ERR] 0.2 scrub stat mismatch, got 63/63 objects, 0/0 clones, 542633/18446744073701705641 bytes, 566/18446744073709543990 kb.
2011-08-08 21:45:01.646482   log 2011-08-08 21:44:56.610022 osd2 10.219.16.42:6801/16231 1 : [ERR] 0.6 scrub stat mismatch, got 76/76 objects, 0/0 clones, 3822237/18446744073709179549 bytes, 3772/18446744073709551292 kb.2011-08-08 21:45:03.779548   log 2011-08-08 21:44:55.965413 osd3 10.200.35.118:6800/1574 10 : [ERR] 0.15 scrub stat mismatch, got 83/83 objects, 0/0 clones, 2329158/18446744073707686470 bytes, 2321/18446744073709549841 kb.

as these came in from multi osds, and just after an upgrade (never seen them before and have done many upgrades), it looks more like a bug than real data corruption.

Cluster layout is:

2011-08-08 22:45:29.712106    pg v35386: 800 pgs: 800 active+clean; 2679 MB data, 13032 MB used, 2979 GB / 3152 GB avail
2011-08-08 22:45:29.713446   mds e91021: 2/2/2 up {0=1=up:active,1=0=up:active}
2011-08-08 22:45:29.713474   osd e102: 4 osds: 4 up, 4 in
2011-08-08 22:45:29.713520   log 2011-08-08 22:38:41.736291 osd3 10.200.35.118:6800/2525 297 : [ERR] 0.15 scrub 1 errors
2011-08-08 22:45:29.713583   mon e1: 3 mons at {0=10.126.174.94:6789/0,1=10.82.103.194:6789/0,2=10.115.202.218:6789/0}

cluster is just a test cluster, no real data and at the time of the upgrade, had no clients accessing it.

attached log of one of the osds after manually requesting a scrub (debug level 20)

Files

osd.3.log.gz (773 KB) osd.3.log.gz

John Leach, 08/08/2011 03:50 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version set to v0.35

Actions

Copy link

Updated by Sage Weil over 12 years ago

Translation missing: en.field_position set to 25

Actions

Copy link

Updated by John Leach over 12 years ago

Just tried writing some data to the ceph filesystem on this cluster and got this message:

2011-08-20 19:26:24.661143   log 2011-08-20 19:16:14.568350 mds1 10.234.213.118:6800/3918 2 : [ERR] dir 20000000441.20000000441 object missing on disk; some files may be lost

not sure if it's related in any way - never seen a message like that before.

Actions

Copy link

Updated by Greg Farnum over 12 years ago

Missing objects on disk sure make it look like data corruption. Your cluster's pretty old, right? Is it still in this state?

Actions

Copy link

Updated by John Leach over 12 years ago

it's a few weeks old yes, but there was no other evidence of of corruption (such as filesystem corruption).

I just deleted the osd data directory on osd1, re-added it to the cluster and let it rebuild and then ran a scrub and the errors came up again.

e.g:

2011-08-26 20:59:12.580635   log 2011-08-26 20:59:05.796145 osd1 10.42.77.90:6800/10085 20 : [ERR] 0.7 scrub stat mismatch, got 1769/1769 objects, 0/0 clones, 5458110067/5449721459 bytes, 5330431/5322239 kb.
2011-08-26 20:59:12.580635   log 2011-08-26 20:59:05.796161 osd1 10.42.77.90:6800/10085 21 : [ERR] 0.7 scrub 1 errors

This is with git commit 9538e87e0 now.

Actions

Copy link

Updated by Sage Weil over 12 years ago

I think this is caused by an old bug. scrub needs to be fixed to properly detect (and ideally repair) it. See #1453.

Actions

Copy link

Updated by Josh Durgin over 12 years ago

Status changed from New to 4

If you still have this cluster around, could you try applying 8293dfabb554883a30af549447995390fafa1f62 to see whether the problem is the old bug?

Actions

Copy link

Updated by John Leach over 12 years ago

I upgraded to get that patch, but also got the on disk filestore update patch which was buggy and broke all my osds, so I can't test this any more, sorry.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version changed from v0.35 to v0.36

Actions

Copy link

#10

Updated by Sage Weil over 12 years ago

Status changed from 4 to Resolved

Ok. Well we're pretty sure what the inconsistency was, and we now complain about it (tho we don't repair it just yet). Making repair work is another bug, #1474.

Actions

Copy link

#11

Updated by Sage Weil over 12 years ago

Target version changed from v0.36 to v0.35
Translation missing: en.field_position deleted (69)
Translation missing: en.field_position set to 1
Translation missing: en.field_position changed from 1 to 898

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #1376

errant scrub stat mismatch logs after upgrade

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by John Leach over 12 years ago

Updated by Greg Farnum over 12 years ago

Updated by John Leach over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Josh Durgin over 12 years ago

Updated by John Leach over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago