MDS: Implement a forward-scrubbing mechanism.
Design and implement a system that checks the filesystem for consistency by starting at the root node.
So far we've conducted a number of conversations and have a few email descriptions, the latest of which I've included below (though it could change). This is an umbrella task.
We maintain a stack of inodes to scrub. When a new scrub is requested, the inode in question goes into this stack at a position depending on how it's inserted.
We have a separate scrubbing thread in every MDS. This thread begins in the scrub_node(inode) function, passing in the inode on the top of the scrub stack.
scrub_node() starts by setting a new scrub_start_stamp and scrub_start_version on the inode (where the scrub_start_version is the version of the parent of the inode). If the node is a file:
the thread optionally spins off an async check of the backtrace (and in the future, optionally checks other metadata we might be able to add or pick up), then sleeps until finish_scrub(inode) is called. (If it doesn't do the backtrace check, it calls finish_scrub() directly).
If the node is a dirfrag:
put the dirfrag's first child on the top of the stack, and call scrub_node(child). Note that this might involve reading the dirfrag off disk, etc.
finish_scrub(inode) is pretty simple. If the inode is a dirfrag:
It verifies that the parent's data matches the aggregate data of the children, then does the same stuff as to a file:
1) sets last_scrubbed_stamp to scrub_start_stamp, and last_scrubbed_version to scrub_start_version.
2) Pops the inode off of the scrub queue, and checks if the next thing up is the inode's parent.
3) If so, calls scrub_node() on the dentry following this one in the parent dirfrag.
3b) if there are no remaining nodes in the parent dirfrag, it checks that all the children were scrubbed following the parent's scrub_start_version (or modified — we don't want to scrub hierarchies that were renamed into the tree following a scrub start), then calls finish_scrub() on the dirfrag.
If at any point the scrub thread finishes scrubbing a node which does not start up another one immediately (implying that another scrub got injected into the middle of one that was already running), it looks at the node in question. If it's a file, it calls scrub_node() on it. If it's a dirfrag, it finds the first dentry in the dirfrag with a last_scrubbed_version less than the dirfrag's last_scrubbed_version, puts that dentry on the scrub_stack, and calls scrub_node() on that dentry.
This is simple enough in concept (although it will need to be broken up quite a bit more in order to do all the locking in a reasonably efficient fashion). To expand this to a multi-MDS system, modify it slightly according to the following rules:
1) Only the authoritative MDS for an inode can scrub that inode.
2) If you are scrubbing a tree and reach an inode for which you are not authoritative, you pass that scrub off to the authoritative node until you get a result, and place the next inode in the tree on the top of the stack and start scrubbing it.
But of course you'll note this doesn't include what to do if the scrubbing turns up an issue. In the initial forward scrub implementation, this is lame: add the bad object to a designated key-value object in the RADOS metadata pool, and set an "inconsistent" flag on it that is propagated up through its ancestors.
#4138 MDS: forward scrub: add functionality to verify disk data is consistent
#4139 MDS: forward scrub: add scrub_stamp infrastructure and a function to scrub a single folder
#4140 MDS: forward scrub: add infrastructure to perform a blocking scrub of all authoritative data
#4141 MDS: forward scrub: Implement non-blocking scrub
#4142 MDS: forward scrub: Implement cross-MDS scrubbing
#4143 MDS: forward scrub: do not abort a scrub if part of its subtree gets migrated
#4144 MDS: forward scrub: do not abort a scrub if its hierarchy gets migrated
#2 Updated by Greg Farnum over 3 years ago
I realized today that we probably want to optionally scrub directories that were renamed into place following a scrub start. Otherwise directories that get renamed a lot might never actually have their contents scrubbed.
Probably what we'd do if the option is enabled is look at the scrub stamps on the subdir, and scrub its contents if the stamp is older than the start stamp on the current scrub we're doing. This could apply all the way down the hierarchy, of course. (Of course we'd rather use monotonic versions than wall-clock stamps, but that gets harder to track...we might be able to work it out with old_inodes or something, though?)