Version 1 - History - Osd - opportunistic whole-object checksums - Ceph - Ceph

1

Jessica Mack

h1. Osd - opportunistic whole-object checksums

2

3

h3. Summary

4

5

Add a whole-object checksum (crc32c) to object_info_t.  Update it when we scrub and object.  Invalidate it when a partial write renders it obsolete.  Use it for scrub when an inconsistency is found to determine which object(s) are damaged and which are correct.

6

7

h3. Owners

8

9

* Sage Weil (Red Hat)

10

11

h3. Interested Parties

12

13

* Name (Affiliation)

14

15

h3. Current Status

16

17

For various reasons RADOS does not track checksums for object data that is stored at rest:

18

* checksums need to be fine-grained in order to accomodate small overwrites

19

* small IOs may not be (checksum) block aligned

20

* there is presumably some performance penalty associated with updating checksum xattrs

21

* btrfs does this for you; historically we haven't wanted to duplicate functionality

22

23

(Note that for erasure coded objects, we *do* store checksums because we have restricted the set of allowed operations.)

24

However, periodically we scrub and *do* generate a whole-object checksum.  We send it over the wire to compare with other replicas, but we do not store it on disk.  There is one crc for byte data and one for omap data.

25

26

h3. Detailed Description

27

28

Add two sets of fields to object_info_t:

29

# uint32_t data_crc32c; bool data_crc32c_valid;

30

# uint32_t omap_crc32c; bool omap_crc32c_valid;

31

32

On scrub, compare the newly calculated checksum to the stored one (if present) and complain and any new inconsistency.

33

If there isn't a stored crc, update the object_info_t to store it.

34

Note that this will generate write IO during scrub for any recent objects.  Viewed in the aggregate, this is one additional IO per object before it becomes cold.  If the object is warm, the one additional IO isn't as significant.  If it is becoming cold, this is the last one.  If it is already cold, the crc is already stored and there is no additional load.

35

We can mitigate some of this cost by only storing the crc if the object age (as measured by now - mtime) is greater than some threshold.  This should be a tunable.

36

37

h3. Work items

38

39

h4. Coding tasks

40

41

# add object_info_t fields

42

# set object_info_t fields during scrub when they are not present (by generated a new repop)

43

# limit updates if object is too new (based on mtime)

44

# update scrub to compare current crc to stored crc and complain accordingly

45

# update repair logic to prefer replicas that match their stored crc

Project

General

Profile

Ceph

Osd - opportunistic whole-object checksums » History » Version 1