Project

General

Profile

Inline data support for Ceph

Summary

Inline data is a good feature for accelerating small file access, which is present in mainstream local file systems, for example, ext4, btrfs etc. It should be beneficial to let Ceph implement this optimization, since it could save the client the calculation of object location and communication with the OSDs. It hopefully will receive a good IO speedup for small files traffic.

Owners

  • Li Wang (UbuntuKylin)

Interested Parties

  • Greg Farnum
  • Sage Weil
  • Loic Dachary

Current Status

Under design

Detailed Description

For a typical Ceph file access traffic, client first asks mds for metadata, then communicates with osd for file data.
If a file is very small, its data can be stored together with the metadata, as an extended attribute. While opening a small
file, osd will receive file metadata as well as data from mds, the calculation of object location as well as communication with osd are saved.
The INLINEDATA will be a mount option to be allowd to turned on.

Algorithm

The key idea befind is to maintain a state machine with three states,

INLINED indicate the first page of a file is stored in MDS;
NOTINLINING indicate the file intend to not be inlined, the first page remains UPTODATE on MDS;
NOTINLINED indicate the first page is stored in OSD

To avoid the frequent write to introduct extra IO overhead for MDS, the write frequency of inlined files are recorded by MDS,
if it exceeds the threhold, MDS will transfer the file status to NOTINLINING to force client to write to OSD.

1 Client side

1.1 write_page()

if (page->index  0 && inode->status  INLINED) {
err = write_page_to_mds();
if (err ESTATUS) // status has changed to NOTINLINING or NOTINLINED
write_page_to_osd();
return;
}
write_page_to_osd();

1.2 ceph_write_end()

if (inode->status  INLINED) {
if (write_pos > PAGE_SIZE) {
inode->status = NOTINLINING;
mark_inode_dirty(); // ansynchoronously tell mds to change status to NOTINLINING
}
if (the interval [write_pos, write_pos + write_len] overlap with the interval [0, PAGE_SIZE]) {
inode->status = NOTINLINED;
mark_inode_dirty();
}
}
if (inode->status == NOTINLINING) {
if (the interval [write_pos, write_pos + write_len] overlap with the interval [0, PAGE_SIZE]) {
inode->status = NOTINLINED;
mark_inode_dirty();
}
}

1.3 read_page()

if (page->index  0 && (inode->status  INLINED || inode->status == NOTINLING)) {
err = read_page_from_mds();
if (err == ESTATUS) // status has changed to NOTINLINED
read_page_from_osd();
return;
}
read_page_from_osd();

2 MDS side

if (inode->status == INLINED && write_frequency_of_page_zero > THREHOLD)
      inode->status = NOTINLINING;
if (received_write_page_zero_request_from_client() && inode->status == NOTINLINING) {
         err = ESTATUS;
         send_response_to_client(err);
         inode_status = NOTINLINED;
}

Work items