Librados - expose checksums¶
Allow librados write operations to pass checksum metadata long with data buffers so that this information can follow the data all the way down the stack for verification at various layers. This will close one of several gaps in providing true end-to-end data integrity verification.
- Sage Weil (Red Hat)
- Name (Affiliation)
- Name (Affiliation)
We checksum data as it passes over the wire, and verify it on the other end (to detect network bit flips that TCP's checksumming misses).
We periodically read data off disk, calculate a new checksum, and compare it to replicas (to detect bit rot).
We use crc32c throughout.
Detailed Description¶We should define a generic, extensible way to describe the checksum of a buffer or data extent. We use crc32c today, but we should avoid locking ourselves into a single scheme for all time.
We should make new write call variants (e.g., rados_write2, rados_write_full2, etc.) that optionally accept checksum metdata for the data buffer being passed in. Same goes for the read operations (we should pass out the checksum metdata along with the data).
Initially, these will not do much.
Eventually, we can extend the internal protocol (the messenger on-wire protocol and/or the rados MOSDOp[Reply] encodings) to pass this data over the wire to the OSD and back.
In some cases, we can use it to populate the checksum fields in object_info_t (see other blueprint).
We can have options to control which layers of the stack recalculate the checksum over the current data buffer for verification against the provided checksum. We probably do not want to do this at every layer (too expensive), but should do it at least once before writing the data to disk, ideally at a point that avoids propagating a bit flip to multiple replicas.
Other things we might do with this later:
- pass checksum info through the ObjectStore interface
- add a rados class (or native rados op) to calculate/verify checksum info on the server side ("fixity check")
- define extensible checksum metadata structure
- define new librados read/write calls that include option checksum metadata (C and C++ interfaces)
- extend MOSDOp and MOSDOpReply to pass checksum metadata for the data payload
- allow this metadata, when present, to allow the messenger to skip the crc it is doing for the data payload