Project

General

Profile

Feature #4145

MDS: design and implement a backwards-scanning fsck

Added by Greg Farnum about 11 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

We've discussed this a little bit, but we eventually want a mechanism that looks through all the RADOS objects in our metadata and data pools and reconstructs a filesystem hierarchy from them. The latest description of the problem we have is reproduced below from an email thread on ceph-devel:

A reverse scan fsck will only be started at admin request, or if a forward scrub detects inconsistencies. It disables client writes on the cluster.
Very broadly:
One MDS is the scrub leader, responsible for maintaining the scrub list. It might initially contain the list of problem inodes found in a forward scrub, but it is in general populated by iterating through all the objects in the metadata (and then data) pools.
For each directory or file head object, if it is not marked as already scrubbed into place, the scrub leader attempts to find that item within the already-known tree, using the (coming very shortly!) lookup-by-ino functionality. If it can't place the inode, it chooses to temporarily believe the backtrace on the inode and creates the necessary directories and links, marking them as tentative and including the version of the backtrace they came from. It then starts a forward scrub on the dirfrag closest to the root that it was able to retrieve off disk (that might be nothing, if it can't find any). (This forward scrub will also be marked as based on a tentative backtrace, with the version it came from.) Any inconsistencies the forward scrub finds are marked and written to reference objects for later review. (This would include things like "I'm sure the backtrace this inode has which points to me is wrong, because I have a higher version and lack a dentry for it"). Similarly, if the forward scrub finds objects on disk with outdated data, it updates their data and marks the reference objects to note that the object was fixed (and the version it was fixed up to). If it finds newer data on disk, it incorporates that into the current tree (with the tentative markings and the versions that are associated). If the newer data points to a dirfrag that isn't yet in the tree, it inserts a fake entry and puts it at the bottom of the scrub queue. It then continues the forward scrub from the node it was on.
If we find an on-disk version in either a forward or reverse scrub which places authority for a subtree we're accessing, we stop any on-going activity and ship it to the authoritative node. If we discover that we should have authority over a node that somebody else is currently holding, we send them a message and they stop working on it and ship it over to us.
An object which does not contain a backpointer and that has no forward referrents gets placed into a lost+found directory. :(
Once we've completely traversed the CephFS pools, we take the existing tentative metadata as correct, toss out the pre-fsck versions, and clean up.

This obviously elides a lot of important details, but I think it describes an object-listing-based fsck that we can use to recover all the data the cluster has into the filesystem hierarchy in a way that scales. I believe the most difficult parts which aren't described here will be a mechanism that allows maintaining both the original un-changed data, and the in-progress fsck versions of the inodes, in a way that allows us to maintain our standard hierarchy migration mechanisms, journaling (or perhaps not, in this mode), and directory object management tools. Assuming we can do that (I think we can!), then this won't be fast, but it will be robust and hopefully not many times slower than an optimal algorithm would be.

History

#1 Updated by Greg Farnum almost 8 years ago

  • Status changed from New to Resolved

It looks a little different now, and we have other tickets to improve stuff, but cephfs-data-scan shoudl qualify this as resolved.

#2 Updated by Greg Farnum over 7 years ago

  • Component(FS) MDS added

Also available in: Atom PDF