Project

General

Profile

Optimize Newstore for massive small objects storage

Summary
There are more and more companies adopting Ceph as their storage solution, ceph is doing extremely well in RBD and large object storage , but as a lot of results from both Intel and other user clearing showing the issue of Ceph in “Lots Of Small File” issue.

In LOSF case, the average object size is as small as 10s to 100s KB, which is usually the size of a compressed image/HTML/Text/Pdf. In the current approach , the objects will live on the FS as individual files, which usually means millions of files in FS. This will over-run the FS and introduce large read/write amplification since every IO need to go through the whole tree.

Newstore introduced fragement_list, which de-coupled the logical object and physical location., and it could use open_by_handler to reduce the cost of tree-traverse. From the first design ,we allow one object to have multiple fragment, now we would like to extend the object->fragment mapping from 1: N to N: M, that means, we want to make multiple object sharing one fragment.

Owners
Xiaoxi CHEN (Intel)

Interested Parties

Xiaoxi CHEN (Intel)
Jian Zhang (Intel)
Guang Yang (Yahoo!)

Current Status

There are existing facilities in newstore, in fragement_t, we already have an offset and lengh to the file.
struct fragment_t {
uint32_t offset; ///< offset in file to first byte of this fragment
uint32_t length; ///< length of fragment/extent
fid_t fid; ///< file backing this fragment

Detailed Description
This is the big one!  Please provide a detailed description for the proposed change.  Where appropriate, include your architectural approach, a list of systems involved, important consequences, and issues that are still unresolved.

Work items
This section should contain a list of work tasks created by this blueprint.  Please include engineering tasks as well as related build/release and documentation work.  If this blueprint requires cleanup of deprecated features, please list those tasks as well.

Coding tasks
Task 1
Task 2
Task 3

Build / release tasks
Task 1
Task 2
Task 3

Documentation tasks
Task 1
Task 2
Task 3

Deprecation tasks
Task 1
Task 2
Task 3