Osd - opportunistic whole-object checksums¶
Add a whole-object checksum (crc32c) to object_info_t. Update it when we scrub and object. Invalidate it when a partial write renders it obsolete. Use it for scrub when an inconsistency is found to determine which object(s) are damaged and which are correct.
- Sage Weil (Red Hat)
- Name (Affiliation)
Current Status¶For various reasons RADOS does not track checksums for object data that is stored at rest:
- checksums need to be fine-grained in order to accomodate small overwrites
- small IOs may not be (checksum) block aligned
- there is presumably some performance penalty associated with updating checksum xattrs
- btrfs does this for you; historically we haven't wanted to duplicate functionality
(Note that for erasure coded objects, we do store checksums because we have restricted the set of allowed operations.)
However, periodically we scrub and do generate a whole-object checksum. We send it over the wire to compare with other replicas, but we do not store it on disk. There is one crc for byte data and one for omap data.
Detailed Description¶Add two sets of fields to object_info_t:
- uint32_t data_crc32c; bool data_crc32c_valid;
- uint32_t omap_crc32c; bool omap_crc32c_valid;
On scrub, compare the newly calculated checksum to the stored one (if present) and complain and any new inconsistency.
If there isn't a stored crc, update the object_info_t to store it.
Note that this will generate write IO during scrub for any recent objects. Viewed in the aggregate, this is one additional IO per object before it becomes cold. If the object is warm, the one additional IO isn't as significant. If it is becoming cold, this is the last one. If it is already cold, the crc is already stored and there is no additional load.
We can mitigate some of this cost by only storing the crc if the object age (as measured by now - mtime) is greater than some threshold. This should be a tunable.
- add object_info_t fields
- set object_info_t fields during scrub when they are not present (by generated a new repop)
- ?limit updates if object is too new (based on mtime)
- update scrub to compare current crc to stored crc and complain accordingly
- update repair logic to prefer replicas that match their stored crc