Rgw - compound object (phase 1)


This is a proposal to add "compound object" to RADOS Object Store that can group several small objects into one single RADOS object.


  • Ray Lv (Yahoo)
  • Name (Affiliation)
  • Name

Interested Parties

  • Danny Al-Gaaf (Deutsche Telekom AG)
  • Name (Affiliation)
  • Name

Current Status

Detailed Description

The current RADOSGW stores an object into one 512 KB head chunk and many 4 MB tail chunks. In the scenario that most of objects are in small size (<512 KB), there will be lots of small files on RADOS. Moreover, within erasure coding pool, an object is splitted into several (~ 10) data chunks and parity chunks. There will be more than 10x number of smaller files on OSD hosts. A commodity storage node configuration today usually has tens of TB disk capacity while tens of GB RAM on each host. So there will be > 20 millions of files on each host in the above small file scenario. It leads to low efficiency of dcache and inode cache on a host with only tens of GB RAM. The cache inefficiency causes higher latency of read/write and even longer tail latency of read for cold data.

Possible Solutions

One approach is to reduce the memory used for filesystem metadata. Here are some possible solutions:
1) Haystack-like RADOS Store: In this solution, small files are stored in a large file and accessed with librados API transparently. But it need additional directory functionality to managing mapping between object ids and large files. And the impact on OSD FileStore is considerable.
2) Use K/V store rather file store for small files: The idea is to store small files as omaps key/value pairs of shared objects. Or going further, storing all files in K/V store. But the read performance of LevelDB with TBs data volume is under question.
3) Application encodes/decodes small files to/from a single object with range metadata.
Solution Description
This proposal is to reduce the number of small files on RADOS by introducing a "compound object" concept to RADOSGW. For small files which are updated/deleted together, the application can group them into one single "compound object" with data ranges in metadata. While reading a particular part of data, RADOSGW decodes the data range tag, finds out the data ranges from the object's metadata, and retrieves the data in the given range. In phase 1, the delete/update can only happen to the whole compound object, but the interface and data schema will allow future extensions. For write, client needs to encode data parts to a single object and pass metadata. For read, client specifies rangeTag parameter in URL, and CEPH decodes out data on range.
  • Compatibility with HTTP Range: HTTP range header is relative to the data range of rangeTag.
  • Per range metadata (compound metadata): Besides the byte range, metadata of a specific range can be defined on write. It is prefixed with "x-amz-meta-compound” to distinguish from plain metadata.
  • Append range to existing compound object: It the PUT operation is happening on existing compound object, the data of new ranges will be appended to the compound object.


The data range of an object is stored as user-defined metadata of the object. It includes rangeTag and byte range for each part as shown as the following examples.
1) PUT Object
PUT /Bucket001/6619810111 HTTP/1.1
x-amz-meta-compound-ranges: blue_icon="0-4361", red_icon=“<start-byte>-<end-byte>", …

The PUT request can also contains per range metadata in headers:
x-amz-meta-compound-ETag: blue_icon="xyzzy", red_icon=“xyxxy", …
x-amz-meta-compound-Content-Type: blue_icon="application/octet-stream", red_icon=“image/jpg", …

2) GET Object
GET /Bucket001/6619810111?rangeTag=blue_icon HTTP/1.1

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3