Project

General

Profile

Actions

Bug #16691

closed

sepia LRC lost directories

Added by Greg Farnum almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
fsck/damage handling
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If you log in to the sepia long-running cluster, it has 37 directories whose objects it lost.

I spot-checked one of them, and the object exists but has zero size and no omap keys. All the directories are old (early 2014 at the newest) and they appear to be sequentially-numbered. This sum of evidence makes me think that we ran the tmap2omap upgrade tool on it and something went wrong with these older directories (perhaps older encodings weren't handled correctly).

Actions #1

Updated by Zheng Yan almost 8 years ago

what do you mean they are old? what does 'rados stat xxxx' show?

Actions #2

Updated by John Spray almost 8 years ago

  • Assignee set to Greg Farnum
Actions #3

Updated by John Spray almost 8 years ago

Plan is for greg to look into the TMAP2OMAP OSD code to look for what might have causd that.

Afterwards John+Doug will get into trying to clean up the cluster with our repair tools.

Actions #4

Updated by Greg Farnum almost 8 years ago

  • Subject changed from sepia LRC lost directories (tmap2omap went bad?) to sepia LRC lost directories
  • Category changed from Correctness/Safety to fsck/damage handling
  • Assignee changed from Greg Farnum to John Spray

Well, I checked the code again and the tmap2omap path looks appropriately durable.

I did notice one thing that helps explain it a little: we pass a "nullok" flag when invoking tmap2omap, which makes the operation succeed even if there is no data present in the object. This is required, since if the directory is already an omap, there is no tmap data. But it means a previously-broken object won't get detected during this upgrade. :(

Anyway, this cluster has been damaged in various ways in the past. I think these directories simply got broken in the depths of time and are only now being noticed (so, hurray damage detection!).

Assigning to John for him and Doug to clean up.

Actions #5

Updated by John Spray almost 8 years ago

(Mainly for my reference) etherpad from repairing is here http://etherpad.corp.redhat.com/efev9SA7rn

Actions #6

Updated by John Spray almost 8 years ago

  • Priority changed from Urgent to High

The offending dentries that point to damaged dirfrags have been removed (by removing the omap keys). The objects themselves are still in the system but not in a way that will make it unhappy.

Actions #7

Updated by John Spray over 7 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF