Project

General

Profile

Rados - at-rest compression » History » Version 1

Josh Durgin, 11/06/2015 02:21 AM

1 1 Josh Durgin
h1. +*Adding Data-At-Rest compression to EC pools*+ 
2
3
*Summary*
4
Erasure Coding(EC) pool introduction to Ceph brings storage space saving at the expense of additional CPU utilization. Compression is pretty similar feature from this point of view. And these features can be easily coupled together to gain even more space saving transparently to users. To do that compression support implementation can be done at EC pool layer.
5
The Compression modules should be pluggable similar to Erasure Coding ones. This allows different compression methods usage as well as hardware optimization utilization. Ceph administrators should be able to enable/disable compression on per-pool basis using regular CLI means.
6
7
*Owners*
8
Igor Fedotov (Mirantis)
9
Alyona Kiseleva (Mirantis)
10
Name
11
12
*Interested Parties*
13
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
14
Name (Affiliation)
15
Name (Affiliation)
16
Name
17
18
*Current Status*
19
There is no Data-At-Rest compression support at Ceph for now.
20
21
*Detailed Description*
22
Currently EC pools handle data append and random read requests only. Such requests can be issued by both Ceph cache-tier or Ceph users directly. At the same time EC pool implementation prohibits random write requests that is positive from compression support perspective. These requests brings tons of complexities for compression introduction.  
23
Thus the suggested approach is to perform appended data block compression at EC pool level before applying erasure coding. Obviously decompression to take place after erasure decoding.
24
Schematically this looks like:
25
<pre>
26
<data block to append> -> [Compressor] -> <Compressed data block> -> [Erasure Coder] -> <Erasure Coded shards> -> [File Stores]
27
<read data block> <- [Decompressor] <- <Compressed data block> <- [Erasure Decoder] <- <Erasure Coded shards> <- [File Stores]
28
</pre>
29
Please note that "data block to append" offset and size tend to be aligned with stripe width ( default 4K for EC pool). The exception is the last block that can have arbitrary size. It's padded by the EC pool though.
30
At the same time "read data block"  offset and size can be absolutely random.
31
The feature of any compression algorithm is that it operates with data blocks rather than specific bytes. I.e. one needs to decompress some block completely to access specific bytes stored within it.
32
Evidently this causes some overhead when data-to-read isn't aligned with originally compressed blocks. There is probably no way to eliminate this issue completely - the only complete solution is to prohibit client fom reading data in unaligned manner. But that's probably inappropriate. We can however try to reduce such overhead by limiting the size of data blocks processed by the compressor. Thus we should split/assemble incoming data into some blocks of limited size. The most straigtforward solution is to use stripe width for that. But this may degrade compression ratio for some cases. At the same time it's inappropriate to have some unprocessed data between two append requests to assemble larger block as it considerably complicates the handling and brings consistency/reliability issues. Thus the suggestion is to use data block provided by the user (it's stripe aligned) and split it into smaller chunks. The chunk size can be multiple of stripe width. Thus original chunk offsets and sizes continue to be stripe-aligned but chunk sizes aren't fixed. Then each chunk to be compressed and erasure coded independently.
33
34
Another thing to mention is the need to map <offset, length> pair from read request to the corresponding chunks offsets <start_chunk_original_offset, start_chunk_compressed_offset, ending_chunk_compressed_offset> in the compressed data set. Compressed offsets are required to perform data retrieval (EC shards) from File Store entities.
35
In fact one needs some MapFn(offset, next_chunk_flag) -> <chunk_offset,chunk_coffset> function, where 'offset' is original data offset, next_chunk_flag - specifies if corresponding or next chunk information to be returned, 'chunk_offset' - original data offset the corresponding or next chunk contains, 'chunk_coffset' - chunk offset in the compressed data set. Mentioned <offset, length> pair can be easily converted to <start_chunk_original_offset, start_chunk_compressed_offset, ending_chunk_compressed_offset> tuple using the following calls: start_chunk_original_offset, start_chunk_compressed_offset = MapFn(offset, false); ending_chunk_compressed_offset = MapFn(offset+length, true).chunk_coffset.
36
When mapped one needs to retrive chunks in <starting_chunk_compressed_offset, last_chunk_compressed_offset> range from File Stores, decompress them and extract desired data subset that is < offset - starting_chunk_original_offset, length>.
37
To provide desired mapping we can use a subset of <key,value> pairs ( name that Compression Info ) from object xattrs and update them on each append. Read request handling uses this Compression Information to perform the mapping. 
38
Compression Info can have following structure:
39
40
<pre>
41
Multiple <chunk_key,chunk_value> records. One per chunk.
42
chunk_key=CompressionInfo prefix + Original Chunk Offset
43
chunk_value=Compression Method indicator + Compressed Chunk Offset
44
Single <key, value> record ( let's name it Master Record ) to track total original/compressed data sizes and some other compression information if any ( e.g. preferred compression method or disable-compression indication flag ).
45
key = CompressionInfo prefix
46
value = OriginalDataSize_current CompressedDataSize_current + flags
47
</pre>
48
49
Here is CompressionInfo sample: there was single 1Mb append followed by additional 128K + 1K ones, chunk size = 256K ( probably too much for the real life ), each chunk from the first append operations was compressed to 1K, others were not uncompressible, CompressionInfo prefix="@CI"
50
<pre>
51
@CI_0 = zlib / 0                       //original length = 256K
52
@CI_262144 = zlib / 1024               //original length = 256K
53
@CI_524288 = zlib / 2048               //original length = 256K
54
@CI_786432 = zlib / 3072               //original length = 256K
55
@CI_1048576 = none / 4096              //original length = 128K
56
@CI_1179648 = none / 135168            //original length = 1K, compressed offset = 4K + 128K
57
@CI = 1180672 / 136192                 // =1179648 + 1024 / = 135168 + 1024
58
59
MapFn results will be:
60
MapFn(0, false) = <0, 0>
61
MapFn(0, true) = <262144, 1024>
62
63
MapFn(1, false) = <0, 0>
64
MapFn(4096, true) = <262144, 1024>
65
66
MapFn(262144, false) = <262144, 1024>
67
MapFn(262144, true) = <524288, 2048>
68
69
MapFn(262145, false) = <262144, 1024>
70
MapFn(262145, true) = <524288, 2048>
71
72
MapFn(1179649, false) = <1179648, 135168>
73
MapFn(1179649, true) = <1180672, 136192>
74
</pre>
75
Additionally we might need to introduce another helper function that uses CompressionInfo to return chunk sizes ( both original and compressed ones):
76
<pre>
77
ChunkSizesFn( offset ) -> <original chunk size, compressed chunk size> 
78
</pre>
79
where 'offset' specifies original offset of data that desired chunk keeps, "original chunk size" - amount of uncompressed(original) data that chunk contains, "compressed chunk size" - amount of compressed data that chunk contains.
80
For the sample above ChunkOffsetFn() results are:
81
<pre>
82
ChunkSizesFn(0) = <262144, 1024>
83
ChunkSizesFn(1) = <262144, 1024>
84
85
ChunkSizesFn(262144) = <262144, 1024>
86
ChunkSizesFn(262145) = <262144, 1024>
87
88
ChunkSizesFn(1179647) = <131072, 131072>
89
ChunkSizesFn(1179649) = <1024, 1024>
90
</pre>
91
92
Schematically append request handling can be done as follows ( everything to be performed at EC pool layer):
93
94
1) Append operation handler at EC pool receives <object_id, data offset, data length, data octets> tuple that specifies destination object and appended data block.
95
2) Append handler retrieves object xattrs related to compression info ( actually Master Record is enough ). Retrieved xattrs are temporary preserved to be able to rollback if needed. This is absolutely similar to the Hash Info handling performed by the EC pool now.
96
3) Handler virtually splits incoming data block into chunks and compresses each chunk independently.
97
4) Each chunk is padded to stripe width. If there is no space saving for a specific chunk - it's marked as uncompressed.
98
5) Each chunk is passed to Erasure Coding function that produces several shards.
99
6) For each chunk :
100
  a) handler adds a new CompressionInfo chunk record to xattrs (to some internal in-memory container not the actual storage at this point). Key value to contain chunk data offset, value - applied compression method ( including uncompressed option) and offset in compressed data set ( CompressedDataSize_current value )
101
  b) handler updates <OriginalDataSize_current, CompressedDataSize_current> values from the Master Record: OriginalDataSize_current += chunk data length; CompressedDataSize_current += compressed chunk data size; This takes place in memory at this point - actual storage update happens along with shards saving.
102
7) All shards along with appended/updated xattrs are saved to respective file stores. This is existing shards/hash info saving stuff. The only thing we need to append is additional xattrs provision.
103
104
Read request handling can be performed as follows:
105
1) Read operation handler receives <object_id, data offset, data length> tuple that specifies source object and data block to read.
106
2) Handler retrieves object xattrs related to compression info.  Master record and a subset of chunk records ( covering [data offset, data offset + data length] range ) should be preseved for subsequent use.
107
3) Handler obtains offsets in compressed data set for desired chunks: 
108
   start_chunk_compressed_offset = MapFn(data offset, false).chunk_coffset
109
   ending_chunk_compressed_offset = MapFn(data offset+data length, true).chunk_coffset;
110
4) Handler retrieves specific data subset pinted out by <start_chunk_compressed_offset, ending_chunk_compressed_offset> from file stores and decodes it by Erasure Decode.  This step is already present in EC pool, we just provide calculated data range instead of user-provided one.
111
5) Handler iterates over retrieved chunks data ( compressed_data below) and decompress each chunk independently. Chunk bounds can be calculated using ChunkSizesFn:
112
<pre>
113
   uncompressed_data = {} 
114
   decompress_pos = 0
115
   starting_chunk_original_offset = chunk_offs = MapFn(data offset, false).chunk_offset 
116
   end_offs = MapFn(data offset + data length, true).chunk_offset 
117
   while( chunk_offs < end_offs ) {
118
     <chunk_size, chunk_csize > = ChunkSizesFn( chunk_offs );
119
     uncompressed_data += doDecompress( compressed_data, decompress_pos, chunk_csize)
120
     decompress_pos += chunk_csize
121
     chunk_offs += chunk_size
122
   }
123
</pre>
124
7) Handler extracts desired data subset from the collected uncompressed data block:
125
  data_to_return = uncompressed_data.substr( data offset - starting_chunk_original_offset, length)
126
8) Handler returns data to the client
127
128
Please note that provided MapFn & ChunkSizesFn usage is primarily for descriptive purposes. In fact all the required information can be retrieved by single consolidated function call that performs a couple of lookups and partially iterates over the map only once.
129
130
Conclusion.
131
Surely the suggested approach has limited applicability - compression takes place for Erasure Coding pools only. Replicated pool users don't benefit from that. But it looks like the suggested approach is the only one that is reasonably complex from implementation point of view. Any other approach would need specific means for random write access handling that requires much more redesigning and development efforts. At the same time compression and Erasure coding seems to be easily and naturally mergeable.
132
133
*Work items*
134
Add compression support to EC pool
135
Refactor existing message compression infrastructure to support pluggable compression modules.
136
137
*Coding tasks*
138
Task 1
139
Task 2
140
Task 3
141
142
*Build / release tasks*
143
Task 1
144
Task 2
145
Task 3
146
147
*Documentation tasks*
148
Task 1
149
Task 2
150
Task 3
151
152
*Deprecation tasks*
153
Task 1
154
Task 2
155
Task 3