Version 1 - History - Kernel client read ahead optimization - Ceph - Ceph

1

Jessica Mack

h1. Kernel client read ahead optimization

2

3

h3. Summary

4

5

Currently Ceph relies on the general purpose, more exactly, the traditional electromechanical magnetic disks based page cache read ahead algorithm to take care of the pre-fetching. However, the storage backend of Ceph is much different with that of the traditional local file system. Sometimes the default pre-fetching behavior is too conservative for Ceph.  The OSD cluster, composed of up to thousands of nodes takes the responsibility to serve data, the data is striped across the OSD nodes. Thus, it is able to provide much higher IO bandwidth, and supports IO operations in a gather/scatter like manner. In addition, it is possible to capture more read pattern than just sequential read such as stride read to improve pre-fetching performance.

6

7

h3. Owners

8

9

* Li Wang (UbuntuKylin)

10

11

h3. Interested Parties

12

13

* Yan, Zheng (Intel)

14

* Sage Weil (Inktank)

15

* Name

16

17

h3. Current Status

18

19

Under design.

20

21

h3. Detailed Description

22

23

According to the experimental results below, it can be observed that for single thread, the sequential read performance increases from 186MB/s to 920MB/s while the pre-fetching window enlarging from 8MB to 1GB. However, with more concurrency, the optimal pre-fetching window size is decreasing. That is easy to understand, with small number of threads, client is hungry because the data pool is too small, OSDs do not provide enough data to feed client. And, since the latency of a Cephfs read transaction is relatively large, the read performance is poor even with 24 OSDs. Along with data increasing to match the client consuming speed, the throughput is improved due to concurrency.

24

For sequential read, the general prefetching algorithm of Linux kernel works well, the thing may be simply attribute to feed an appropriate prefetching size to Cephfs. But how large is the window? it  shoule be related to the concurrency, client memory consumption, and maybe number of OSDs. We need a simple yet efficient algorithm to determine the optimal prefetching window size. And make it self adaptive.

25

For stride read, the prefetching algorithm of Linux kernel won't help. So does Ceph need its own?

26

27

Memory: 64GB

28

Network bandwidth: 14Gbps+

29

Disk: 90MB/s

30

24 OSD、1 Linux Kernel Client(3.5.0)

31

32

|\2=. Iozone parameter|\8=. Readahead size|

33

|File size(-s)|Process(-t)|8M|16M|32M|64M|128M|256M|512M|1024M|

34

|32G|1|186163.09|297324.84|417127.47|555186.94|709306.69|817544.44|878351.69|942494.00|

35

|16G|2|342822.67|486913.06|621150.12|801558.44|887408.12|943116.62|888473.03|920779.59|

36

|8G|4|538862.75|713201.77|871556.72|957093.25|946758.56|872444.28|923143.67|911103.61|

37

|4G|8|825311.97|924942.88|900193.66|905163.27|925176.36|906694.10|861454.98|872700.76|

38

|2G|16|925981.14|880469.41|928986.88|902153.14|809175.16|800078.70|783691.74|751467.59|

39

|1G|32|916628.86|885516.79|879351.34|840179.34|772684.61|779912.65|835743.68|779849.47|

40

|2G|32|918887.91|872601.05|870850.50|871713.59|823688.11|769177.47|815892.45|652991.57|

41

|4G|32|886133.51|885311.87|906219.99|886668.75|863749.37|835691.18|800252.56|561352.22|

42

|8G|32|922218.38|887759.56|908235.33|899676.75|878892.64|700063.10|694623.93|513792.87|

43

44

h3. Work items

45

46

h4. Coding tasks

47

48

# More testing to determine the best prefetching window size under different number of OSDs

49

# Work out a algorithm to obtain the optimal perfetching window size

50

# Work out a stride read prefetching algorithm, maybe push Linux kernel mm upstream?

51

52

h4. Build / release tasks

53

54

# Task 1

55

# Task 2

56

# Task 3

57

58

h4. Documentation tasks

59

60

# Task 1

61

# Task 2

62

# Task 3

63

64

h4. Deprecation tasks

65

66

# Task 1

67

# Task 2

68

# Task 3

Project

General

Profile

Ceph

Kernel client read ahead optimization » History » Version 1