Project

General

Profile

Kernel client read ahead optimization

Summary

Currently Ceph relies on the general purpose, more exactly, the traditional electromechanical magnetic disks based page cache read ahead algorithm to take care of the pre-fetching. However, the storage backend of Ceph is much different with that of the traditional local file system. Sometimes the default pre-fetching behavior is too conservative for Ceph. The OSD cluster, composed of up to thousands of nodes takes the responsibility to serve data, the data is striped across the OSD nodes. Thus, it is able to provide much higher IO bandwidth, and supports IO operations in a gather/scatter like manner. In addition, it is possible to capture more read pattern than just sequential read such as stride read to improve pre-fetching performance.

Owners

  • Li Wang (UbuntuKylin)

Interested Parties

  • Yan, Zheng (Intel)
  • Sage Weil (Inktank)
  • Name

Current Status

Under design.

Detailed Description

According to the experimental results below, it can be observed that for single thread, the sequential read performance increases from 186MB/s to 920MB/s while the pre-fetching window enlarging from 8MB to 1GB. However, with more concurrency, the optimal pre-fetching window size is decreasing. That is easy to understand, with small number of threads, client is hungry because the data pool is too small, OSDs do not provide enough data to feed client. And, since the latency of a Cephfs read transaction is relatively large, the read performance is poor even with 24 OSDs. Along with data increasing to match the client consuming speed, the throughput is improved due to concurrency.
For sequential read, the general prefetching algorithm of Linux kernel works well, the thing may be simply attribute to feed an appropriate prefetching size to Cephfs. But how large is the window? it shoule be related to the concurrency, client memory consumption, and maybe number of OSDs. We need a simple yet efficient algorithm to determine the optimal prefetching window size. And make it self adaptive.
For stride read, the prefetching algorithm of Linux kernel won't help. So does Ceph need its own?

Memory: 64GB
Network bandwidth: 14Gbps+
Disk: 90MB/s
24 OSD?1 Linux Kernel Client(3.5.0)

Iozone parameter Readahead size
File size(-s) Process(-t) 8M 16M 32M 64M 128M 256M 512M 1024M
32G 1 186163.09 297324.84 417127.47 555186.94 709306.69 817544.44 878351.69 942494.00
16G 2 342822.67 486913.06 621150.12 801558.44 887408.12 943116.62 888473.03 920779.59
8G 4 538862.75 713201.77 871556.72 957093.25 946758.56 872444.28 923143.67 911103.61
4G 8 825311.97 924942.88 900193.66 905163.27 925176.36 906694.10 861454.98 872700.76
2G 16 925981.14 880469.41 928986.88 902153.14 809175.16 800078.70 783691.74 751467.59
1G 32 916628.86 885516.79 879351.34 840179.34 772684.61 779912.65 835743.68 779849.47
2G 32 918887.91 872601.05 870850.50 871713.59 823688.11 769177.47 815892.45 652991.57
4G 32 886133.51 885311.87 906219.99 886668.75 863749.37 835691.18 800252.56 561352.22
8G 32 922218.38 887759.56 908235.33 899676.75 878892.64 700063.10 694623.93 513792.87

Work items

Coding tasks

  1. More testing to determine the best prefetching window size under different number of OSDs
  2. Work out a algorithm to obtain the optimal perfetching window size
  3. Work out a stride read prefetching algorithm, maybe push Linux kernel mm upstream?

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3