Project

General

Profile

Osd - opportunistic whole-object checksums » History » Version 1

Jessica Mack, 07/03/2015 08:39 PM

1 1 Jessica Mack
h1. Osd - opportunistic whole-object checksums
2
3
h3. Summary
4
5
Add a whole-object checksum (crc32c) to object_info_t.  Update it when we scrub and object.  Invalidate it when a partial write renders it obsolete.  Use it for scrub when an inconsistency is found to determine which object(s) are damaged and which are correct. 
6
7
h3. Owners
8
9
* Sage Weil (Red Hat)
10
11
h3. Interested Parties
12
13
* Name (Affiliation)
14
15
h3. Current Status
16
17
For various reasons RADOS does not track checksums for object data that is stored at rest:
18
* checksums need to be fine-grained in order to accomodate small overwrites
19
* small IOs may not be (checksum) block aligned
20
* there is presumably some performance penalty associated with updating checksum xattrs
21
* btrfs does this for you; historically we haven't wanted to duplicate functionality
22
23
(Note that for erasure coded objects, we *do* store checksums because we have restricted the set of allowed operations.)
24
However, periodically we scrub and *do* generate a whole-object checksum.  We send it over the wire to compare with other replicas, but we do not store it on disk.  There is one crc for byte data and one for omap data.
25
26
h3. Detailed Description
27
28
Add two sets of fields to object_info_t:
29
# uint32_t data_crc32c; bool data_crc32c_valid;
30
# uint32_t omap_crc32c; bool omap_crc32c_valid;
31
32
On scrub, compare the newly calculated checksum to the stored one (if present) and complain and any new inconsistency.
33
If there isn't a stored crc, update the object_info_t to store it.
34
Note that this will generate write IO during scrub for any recent objects.  Viewed in the aggregate, this is one additional IO per object before it becomes cold.  If the object is warm, the one additional IO isn't as significant.  If it is becoming cold, this is the last one.  If it is already cold, the crc is already stored and there is no additional load.
35
We can mitigate some of this cost by only storing the crc if the object age (as measured by now - mtime) is greater than some threshold.  This should be a tunable.
36
37
h3. Work items
38
39
h4. Coding tasks
40
41
# add object_info_t fields
42
# set object_info_t fields during scrub when they are not present (by generated a new repop)
43
# ​limit updates if object is too new (based on mtime) 
44
# update scrub to compare current crc to stored crc and complain accordingly
45
# update repair logic to prefer replicas that match their stored crc