CephFS fsck Progress/Ongoing Design

John has built up a bunch of tools for repair, and forward scrub is partly implemented. In this session we'll describe the current state and the next steps and design challenges.

Greg Farnum (Red Hat)
John Spray (Red Hat)

Interested Parties
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
Name (Affiliation)
Name (Affiliation)

Current Status
There is a PR for backwards repair tools (depending on sharded pgls to run scalably). There are single-node scrub calls in the tree and a branch with some amount of forward scrub work. These need to integrated and made scalable.

cephfs-data-scan (branch wip-offline-backward)

This tool (landing soon) is capable of reconstructing the metadata pool even if everything in it is deleted (which is just what the test for it does). As a disaster recovery tool it's always going to be "best effort", but it lets you get at your files. It can also dump out files to a local filesystem instead of reconstructing cephfs metadata.

Backward scrub (aka cephfs-data-scan) operates in two stages:

  • scan_extents: for all data objects, update xattrs on the 0th object (i.e. the xyz.00000000) object for the inode to record the highest object seen for a given inode, so as to calculate the size of the inode. Same procedure for mtime. Also record the size of the largest object seen to aid in guessing the object size in the layout.
  • scan_inodes: for all 0th objects, use the backtrace and the results from scan_extents to generate a path and some metadata for where we hope to see a dentry for the inode. Open or create the ancestor dirfrags in the path, and insert a dentry linking to the inode.

Tickets #12130 thru #12145 record a bunch of improvements for cephfs-data-scan that have already been thought of. Notably:
#12137 - doing a similar backward scan process for the metadata pool as well as a data pool, in order to link in orphaned dirs
#12136 - detect snapshots during backward scan

Already written new classes in ceph-qa-suite exist for creating damage scenarios -- we should continue to extend these as we add repair functionality (and extend this mechanism to cover testing the detection of faults by the forward scrub)

The tool was originally written to support parallelism using the now-discarded first cut of the sharded pgls: it'll regain that ability once the second attempt at sharded pgls lands. Currently this relies on the user going out and starting a whole range of instance of the tool on many clients, but we can write a tool to orchestrate this (#12143)

To avoid attempting to inject linkage for every single inode in the filesystem, cephfs-data-scan should soon be changed to consume a flag set by forward scrub, to indicate which inodes are potential orphans. This would be a small minority, and the usual mode of operation where the metadata pool is partially damaged rather than entirely lost.


There is a nascent wip-damage-table branch. This is for recording where damage has been found in the filesystem metadata:

  • so that we don't keep trying to load broken things (look up in damage table first)
  • so that we have can see what we need to repair.
  • so that we can remain online and EIO on access to damaged subtrees, rather than marking the whole rank DAMAGED on finding errors

Currently that's just hooking into the error detection in ::fetch() methods, but it would be fed by forward scrub too.

Detailed Description

Work items
This section should contain a list of work tasks created by this blueprint.  Please include engineering tasks as well as related build/release and documentation work.  If this blueprint requires cleanup of deprecated features, please list those tasks as well.

Coding tasks
Task 1
Task 2
Task 3

Build / release tasks
Task 1
Task 2
Task 3

Documentation tasks
Task 1
Task 2
Task 3

Deprecation tasks
Task 1
Task 2
Task 3