Mds - Inline data support (Step 2)


We have worked out a preliminary implementation for inline data support, and observed obvious speed up for small file access. To our knowledge, the algorithm is without correctness issue even under the situation of share write or read/write.
The step 2 will focus on trying to make things simpler to eliminate the state of a file half-inlined, and solve the protocal backword compatability.


  • Li Wang (UbuntuKylin)

Interested Parties

  • Sage Weil (Inktank)
  • Name (Affiliation)
  • Name

Current Status

With preliminary implementation. Need improvement.

Detailed Description

Our initial implementation is described below. To get things simple, the idea is to use MDS as the read cache of inline data, instead of storing inline data into the inode.

The MDS would read the inline data from OSD while it is fetching the metadata from OSD.

CDir::_fetched {
  foreach(CInode inode in InodeList) {
    if (inode.size < INLINE_MAX_SIZE)

And the inline data will send to client on the CAP_OP_GRANT messages when client doing getattr() call.

CInode::encode_cap_message {
  if (!have_send) {
    encode(inline data, message);
    have_send = true;

Client read/write routine would be almost remain unchanged. For the writers, they write inline data into OSDs as usual, the only difference lies in that the first writer will notice the MDS to abandon inline data cache by sending CAP_OP_UPDATE when the range of written fit into [0 .. INLINE_MAX_SIZE].

The MDS would fetch the inline data again when the last writer exited.

For our current implementation, we maintain a state machine with three states:
(1) INLINE: The file is small enough to be logically stored at MDS;
(2) MIGRATION: The file size is larger than the inline threshold INLINE_size, however, the data within [0, INLINE_SIZE] is still stored at MDS;
(3) DISABLED: The file is no longer inline, which implies the inline data has been transferred into OSD.

The states migration is unrecoverable.
Client to confirm itself owns the up-to-date & complete inline data by checking if it owns the cap Fc.
Each time inline data is revised by client, client will dirty the CAP Fb to notify MDS.
For each file, a version number ‘inline_version’ is stored inside the metadata to indicate the inline state. It is used/shared between MDS and clients. Besides, for each client and each file, MDS maintain a pair of two version numbers: ‘client_inline_version’ and ‘server_inline_version’. The former records the latest inline data version sent to client through CAP_OP_GRANT/CAP_OP_REVOKE etc. The version number ‘local_version’ maintained by client should be equal to the corresponding ‘client_inline_version’. ‘Server_inline_version’ is used to record the new version after client pushing new inline data to MDS by CAP_OP_UPDATE. If client did not ever call CAP_OP_UPDATE, it should be equal to ‘client_inline_version’. The reason that MDS uses two variables is because after MDS receiving the updated inline data from client, it does not have a chance to tell client the new inline data version.


(1) Read operation
if (have cap Fc)

copy inline data into page cache/user buffer


call getattr() to retrieve inline data from MDS

Write operation
if (have Fc)
merge the written data locally by virtue of page cache
according to the file state, commit the data to MDS/OSD


submit the written data directly to MDS, the latter will do the merging job

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3