Project

General

Profile

CephFS - Forward Scrub » History » Version 2

Jessica Mack, 07/01/2015 05:11 AM

1 1 Jessica Mack
h1. CephFS - Forward Scrub
2
3
h3. Summary
4
5
Last year, we spent a while planning and discussing how we wanted to implement fsck in CephFS. That consisted of two parts:
6
# "Forward scrub", in which we start from the root inode and look at everything we can touch in the hierarchy to make sure it is consistent
7
# "Backward scan", in which we look at every RADOS object in the filesystem pools and try to place it into the hierarchy (and do any necessary repairs).
8
 
9
Forward scrub is now in progress; the design session will cover its current state and any outstanding issues that have arisen during implementation. Depending on progress and time constraints we will also discuss how to start developing backward scrub.
10
11
h3. Owners
12
13
* Greg Farnum (Inktank/Red Hat)
14
* Sage Weil (Inktank/Red Hat)
15
* Name (Affiliation)
16
17
h3. Interested Parties
18
19
* Name (Affiliation)
20
* Name (Affiliation)
21
22
h3. Current Status
23
24
We have designed and created tracker tickets of reasonable granularity for this task, and work has started (although it does not exactly conform to the tickets). Some of it can be viewed on the wip-forward-scrub branch.
25
 
26
The wip-inode-scrub branch has been submitted for review: https://github.com/ceph/ceph/pull/2814
27
It contains functionality enabling the scrub of a single on-disk inode.
28
 
29
There is some ongoing work in wip-forward scrub to implement the ScrubStack and CDentry/CDir/CInode state (described below), but it's currently a bit messy.
30
31
h3. Detailed Description
32
33
See #4137 and https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg12568.html for the full written description of the algorithm we're shooting for.
34
 
35
Translating that into real code, I am working on the "ScrubStack" and its implementation. The ScrubStack is going to hold a stack of (pinned) CDentries. When the ScrubStack is ready to start scrubbing a new inode, it will:
36
* ask the top CDentry for the next item to scrub. 
37 2 Jessica Mack
** The CDentry will resolve itself down to either an inode or a directory.
38
** If an inode, return its dentry and pop off of the ScrubStack 
39
** If a directory, return the next dentry in it which needs to be scrubbed
40 1 Jessica Mack
* If the dentry needing scrub is a directory, push it on top of the ScrubStack and query it for the next dentry to scrub (as above)
41
* invoke MDCache::scrub_dentry() on the dentry.
42
* When scrub_dentry() hits our callback, check that it succeeded and then start from the top!
43
44
The CDentry, CDir, and CInode each gain a scrub_info_t struct member (which is different for each of them!) which contains information on the scrub state of each of these.
45
CInode has:
46
* last_scrub_stamp, last_scrub_version — representing the latest completed scrub on this inode (versions are relative to the parent directory version), which it flushes out to the inode_t whenever it's projected (or, eventually, on-demand during the scrub).
47
* scrub_start_stamp, scrub_start_version — representing what time and version an in-progress scrub began
48
* and all of the above for each dirfrag it contains
49
50
CDir has:
51
* scrub_start_version, scrub_start_time — the time this CDir started scrubbing its contents
52
* set<dentry_key_t> directories_[to_scrub|scrubbing|scrubbed] — representing child directories that it needs to scrub, is currently scrubbing, or has scrubbed
53
* set<dentry_key_t> others_[to_scrub|scrubbing|scrubbed] — as previously, but for non-directory children
54
55
(We maintain separate lists of directories and non-directories because recursive scrubbing dirties each inode's scrub stamps, so we scrub subdirectories before regular files.)
56
CDentry is less well-defined but currently has:
57
* CDir *scrub_parent — parent CDir we're a scrub member of
58
* bool scrub_recursive — we want to scrub all recursive descendents of this dentry
59
* bool scrub_children — we want to scrub all direct children of this dentry, regardless of scrub_recursive
60
* Context *on_finish — the callback to activate when scrubbing of this dentry finishes.
61
62
So when ScrubStack wants to find the next dentry to scrub, it
63
* looks at the CDentry on top of the stack
64
* if it's a dir:
65
** we find the first dirfrag which last scrubbed after we started this scrub
66
*** we look at the CDir's sets and get the first dentry to scrub (if it's a directory, the ScrubStack starts over from it)
67
* if it's a file:
68
** well, that was easy
69
 
70
Because the MDCache::scrub_dentry() function is using our generic MDRequest infrastructure, we get a lot of the locking and mileage out of that. Just impementing the described logic will get us through most of the tickets. What remains is:
71
# Appropriately handling non-auth data
72
## we need to write internal op wrapping that we can ues to forward them
73 2 Jessica Mack
## and detect that they're non-auth and set up appropriate callbacks in the ScrubStack?
74 1 Jessica Mack
# This should deal well with stuff getting evicted from cache, but we need to handle migration of scrubbing hierarchies. Right now they're auth pinned so you can't do that, but as a continuously-running background process we don't really want to do that.
75
# (Just thought of this) Prevent scrubbing from moving dentries up the LRU
76
# Surfacing scrub errors to administrators in a useful way.
77
78
h3. Work items
79
80
h4. Coding tasks
81
82
#4138: add functionality to verify disk data is consistent [with CInode]
83
#4139:  add scrub_stamp infrastructure and a function to scrub a single folder
84
#4140: add infrastructure to perform a blocking scrub of all authoritative data [within a single MDS]
85
#4141: Implement non-blocking scrub
86
#4142: Implement cross-MDS scrubbing [ie, initiate remote scrubs when required for a local scrub]
87
#4143: do not abort a scrub if part of its subtree gets migrated
88
#4144: do not abort a scrub if its hierarchy gets migrated
89
90
h4. Build / release tasks
91
92
# Task 1
93
# Task 2
94
# Task 3
95
96
h4. Documentation tasks
97
98
# Task 1
99
# Task 2
100
# Task 3
101
102
h4. Deprecation tasks
103
104
# Task 1
105
# Task 2
106
# Task 3