Project

General

Profile

Kernel client read ahead optimization » History » Version 1

Jessica Mack, 06/21/2015 06:12 AM

1 1 Jessica Mack
h1. Kernel client read ahead optimization
2
3
h3. Summary
4
5
Currently Ceph relies on the general purpose, more exactly, the traditional electromechanical magnetic disks based page cache read ahead algorithm to take care of the pre-fetching. However, the storage backend of Ceph is much different with that of the traditional local file system. Sometimes the default pre-fetching behavior is too conservative for Ceph.  The OSD cluster, composed of up to thousands of nodes takes the responsibility to serve data, the data is striped across the OSD nodes. Thus, it is able to provide much higher IO bandwidth, and supports IO operations in a gather/scatter like manner. In addition, it is possible to capture more read pattern than just sequential read such as stride read to improve pre-fetching performance.
6
 
7
h3. Owners
8
9
* Li Wang (UbuntuKylin)
10
11
h3. Interested Parties
12
13
* Yan, Zheng (Intel)
14
* Sage Weil (Inktank)
15
* Name
16
17
h3. Current Status
18
19
Under design.
20
21
h3. Detailed Description
22
23
According to the experimental results below, it can be observed that for single thread, the sequential read performance increases from 186MB/s to 920MB/s while the pre-fetching window enlarging from 8MB to 1GB. However, with more concurrency, the optimal pre-fetching window size is decreasing. That is easy to understand, with small number of threads, client is hungry because the data pool is too small, OSDs do not provide enough data to feed client. And, since the latency of a Cephfs read transaction is relatively large, the read performance is poor even with 24 OSDs. Along with data increasing to match the client consuming speed, the throughput is improved due to concurrency.
24
For sequential read, the general prefetching algorithm of Linux kernel works well, the thing may be simply attribute to feed an appropriate prefetching size to Cephfs. But how large is the window? it  shoule be related to the concurrency, client memory consumption, and maybe number of OSDs. We need a simple yet efficient algorithm to determine the optimal prefetching window size. And make it self adaptive.
25
For stride read, the prefetching algorithm of Linux kernel won't help. So does Ceph need its own?
26
 
27
Memory: 64GB
28
Network bandwidth: 14Gbps+
29
Disk: 90MB/s
30
24 OSD、1 Linux Kernel Client(3.5.0)
31
 
32
|\2=. Iozone parameter|\8=. Readahead size|
33
|File size(-s)|Process(-t)|8M|16M|32M|64M|128M|256M|512M|1024M|
34
|32G|1|186163.09|297324.84|417127.47|555186.94|709306.69|817544.44|878351.69|942494.00|
35
|16G|2|342822.67|486913.06|621150.12|801558.44|887408.12|943116.62|888473.03|920779.59|
36
|8G|4|538862.75|713201.77|871556.72|957093.25|946758.56|872444.28|923143.67|911103.61|
37
|4G|8|825311.97|924942.88|900193.66|905163.27|925176.36|906694.10|861454.98|872700.76|
38
|2G|16|925981.14|880469.41|928986.88|902153.14|809175.16|800078.70|783691.74|751467.59|
39
|1G|32|916628.86|885516.79|879351.34|840179.34|772684.61|779912.65|835743.68|779849.47|
40
|2G|32|918887.91|872601.05|870850.50|871713.59|823688.11|769177.47|815892.45|652991.57|
41
|4G|32|886133.51|885311.87|906219.99|886668.75|863749.37|835691.18|800252.56|561352.22|
42
|8G|32|922218.38|887759.56|908235.33|899676.75|878892.64|700063.10|694623.93|513792.87|
43
44
h3. Work items
45
46
h4. Coding tasks
47
48
# More testing to determine the best prefetching window size under different number of OSDs
49
# Work out a algorithm to obtain the optimal perfetching window size
50
# Work out a stride read prefetching algorithm, maybe push Linux kernel mm upstream?
51
52
h4. Build / release tasks
53
54
# Task 1
55
# Task 2
56
# Task 3
57
58
h4. Documentation tasks
59
60
# Task 1
61
# Task 2
62
# Task 3
63
64
h4. Deprecation tasks
65
66
# Task 1
67
# Task 2
68
# Task 3