CephFS - Forward Scrub


Last year, we spent a while planning and discussing how we wanted to implement fsck in CephFS. That consisted of two parts:
  1. "Forward scrub", in which we start from the root inode and look at everything we can touch in the hierarchy to make sure it is consistent
  2. "Backward scan", in which we look at every RADOS object in the filesystem pools and try to place it into the hierarchy (and do any necessary repairs).

Forward scrub is now in progress; the design session will cover its current state and any outstanding issues that have arisen during implementation. Depending on progress and time constraints we will also discuss how to start developing backward scrub.


  • Greg Farnum (Inktank/Red Hat)
  • Sage Weil (Inktank/Red Hat)
  • Name (Affiliation)

Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)

Current Status

We have designed and created tracker tickets of reasonable granularity for this task, and work has started (although it does not exactly conform to the tickets). Some of it can be viewed on the wip-forward-scrub branch.

The wip-inode-scrub branch has been submitted for review:
It contains functionality enabling the scrub of a single on-disk inode.

There is some ongoing work in wip-forward scrub to implement the ScrubStack and CDentry/CDir/CInode state (described below), but it's currently a bit messy.

Detailed Description

See #4137 and for the full written description of the algorithm we're shooting for.

Translating that into real code, I am working on the "ScrubStack" and its implementation. The ScrubStack is going to hold a stack of (pinned) CDentries. When the ScrubStack is ready to start scrubbing a new inode, it will:
  • ask the top CDentry for the next item to scrub.
    • The CDentry will resolve itself down to either an inode or a directory.
    • If an inode, return its dentry and pop off of the ScrubStack
    • If a directory, return the next dentry in it which needs to be scrubbed
  • If the dentry needing scrub is a directory, push it on top of the ScrubStack and query it for the next dentry to scrub (as above)
  • invoke MDCache::scrub_dentry() on the dentry.
  • When scrub_dentry() hits our callback, check that it succeeded and then start from the top!
The CDentry, CDir, and CInode each gain a scrub_info_t struct member (which is different for each of them!) which contains information on the scrub state of each of these.
CInode has:
  • last_scrub_stamp, last_scrub_version — representing the latest completed scrub on this inode (versions are relative to the parent directory version), which it flushes out to the inode_t whenever it's projected (or, eventually, on-demand during the scrub).
  • scrub_start_stamp, scrub_start_version — representing what time and version an in-progress scrub began
  • and all of the above for each dirfrag it contains
CDir has:
  • scrub_start_version, scrub_start_time — the time this CDir started scrubbing its contents
  • set<dentry_key_t> directories_[to_scrub|scrubbing|scrubbed] — representing child directories that it needs to scrub, is currently scrubbing, or has scrubbed
  • set<dentry_key_t> others_[to_scrub|scrubbing|scrubbed] — as previously, but for non-directory children
(We maintain separate lists of directories and non-directories because recursive scrubbing dirties each inode's scrub stamps, so we scrub subdirectories before regular files.)
CDentry is less well-defined but currently has:
  • CDir *scrub_parent — parent CDir we're a scrub member of
  • bool scrub_recursive — we want to scrub all recursive descendents of this dentry
  • bool scrub_children — we want to scrub all direct children of this dentry, regardless of scrub_recursive
  • Context *on_finish — the callback to activate when scrubbing of this dentry finishes.
So when ScrubStack wants to find the next dentry to scrub, it
  • looks at the CDentry on top of the stack
  • if it's a dir:
    • we find the first dirfrag which last scrubbed after we started this scrub
      • we look at the CDir's sets and get the first dentry to scrub (if it's a directory, the ScrubStack starts over from it)
  • if it's a file:
    • well, that was easy
Because the MDCache::scrub_dentry() function is using our generic MDRequest infrastructure, we get a lot of the locking and mileage out of that. Just impementing the described logic will get us through most of the tickets. What remains is:
  1. Appropriately handling non-auth data
    1. we need to write internal op wrapping that we can ues to forward them
    2. and detect that they're non-auth and set up appropriate callbacks in the ScrubStack?
  2. This should deal well with stuff getting evicted from cache, but we need to handle migration of scrubbing hierarchies. Right now they're auth pinned so you can't do that, but as a continuously-running background process we don't really want to do that.
  3. (Just thought of this) Prevent scrubbing from moving dentries up the LRU
  4. Surfacing scrub errors to administrators in a useful way.

Work items

Coding tasks

#4138: add functionality to verify disk data is consistent [with CInode]
#4139: add scrub_stamp infrastructure and a function to scrub a single folder
#4140: add infrastructure to perform a blocking scrub of all authoritative data [within a single MDS]
#4141: Implement non-blocking scrub
#4142: Implement cross-MDS scrubbing [ie, initiate remote scrubs when required for a local scrub]
#4143: do not abort a scrub if part of its subtree gets migrated
#4144: do not abort a scrub if its hierarchy gets migrated

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3