Bug #16691: sepia LRC lost directories - CephFS - Ceph

Actions

Copy link

Bug #16691

closed

sepia LRC lost directories

Added by Greg Farnum almost 8 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

John Spray

Category:

fsck/damage handling

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

If you log in to the sepia long-running cluster, it has 37 directories whose objects it lost.

I spot-checked one of them, and the object exists but has zero size and no omap keys. All the directories are old (early 2014 at the newest) and they appear to be sequentially-numbered. This sum of evidence makes me think that we ran the tmap2omap upgrade tool on it and something went wrong with these older directories (perhaps older encodings weren't handled correctly).

Actions

Copy link

Updated by Zheng Yan almost 8 years ago

what do you mean they are old? what does 'rados stat xxxx' show?

Actions

Copy link

Updated by John Spray almost 8 years ago

Assignee set to Greg Farnum

Actions

Copy link

Updated by John Spray almost 8 years ago

Plan is for greg to look into the TMAP2OMAP OSD code to look for what might have causd that.

Afterwards John+Doug will get into trying to clean up the cluster with our repair tools.

Actions

Copy link

Updated by Greg Farnum almost 8 years ago

Subject changed from sepia LRC lost directories (tmap2omap went bad?) to sepia LRC lost directories
Category changed from Correctness/Safety to fsck/damage handling
Assignee changed from Greg Farnum to John Spray

Well, I checked the code again and the tmap2omap path looks appropriately durable.

I did notice one thing that helps explain it a little: we pass a "nullok" flag when invoking tmap2omap, which makes the operation succeed even if there is no data present in the object. This is required, since if the directory is already an omap, there is no tmap data. But it means a previously-broken object won't get detected during this upgrade. :(

Anyway, this cluster has been damaged in various ways in the past. I think these directories simply got broken in the depths of time and are only now being noticed (so, hurray damage detection!).

Assigning to John for him and Doug to clean up.

Actions

Copy link

Updated by John Spray almost 8 years ago

(Mainly for my reference) etherpad from repairing is here http://etherpad.corp.redhat.com/efev9SA7rn

Actions

Copy link

Updated by John Spray almost 8 years ago

Priority changed from Urgent to High

The offending dentries that point to damaged dirfrags have been removed (by removing the omap keys). The objects themselves are still in the system but not in a way that will make it unhappy.

Actions

Copy link

Updated by John Spray over 7 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #16691

sepia LRC lost directories

Updated by Zheng Yan almost 8 years ago

Updated by John Spray almost 8 years ago

Updated by John Spray almost 8 years ago

Updated by Greg Farnum almost 8 years ago

Updated by John Spray almost 8 years ago

Updated by John Spray almost 8 years ago

Updated by John Spray over 7 years ago