Project

General

Profile

A standard framework for Ceph performance profiling with latency breakdown » History » Version 2

Jessica Mack, 07/06/2015 09:00 PM

1 1 Jessica Mack
h1. A standard framework for Ceph performance profiling with latency breakdown
2
3
h3. Summary
4
5
When we working on Ceph performance evaluation and optimization, we found how to trouble shoot the bottlenecks, identify the best tuning knobs from many parameters and handle the unexpected performance regression between different releases is pretty difficult. So we'd like to propose a general framework based on CBT and existing LTTng and other components we need to add to build a user friendly framework to profile and tune Ceph performance. 
6
7
h3. Owners
8
9
* Chendi Xue (Intel)
10
* Jian Zhang (Intel)
11
* Mark Nelson (Red Hat)
12
13
h3. Interested Parties
14
15
* Andrew Shewmaker (Red Hat, UCSC)
16
* Danny Al-Gaaf (Deutsche Telekom)
17
* Name
18
19
h3. Current Status
20
21
 From our point of view, the framework should comprises three parts of work:
22
(1) Deployment
23
CBT to deploy the Ceph cluster
24
script to deploy lttng and zipkin.
25
(2) Workload generator,
26
Enable COSBench as a CBT Object storage plugin, Integrate COSBench into CBT as the object(rgw) workload generator.
27
Current proposed approach is:
28
Let the end user to deploy COSBench
29
Add a plugin model in CBT to extract the COSBench related parameters and translate it to COSBench XML files
30
Use the plugin to kick off COSBench to generate  the load and get the data
31
(3) Analyzer
32
Adding more trace points
33
Leverage blkin (LTTng + Zipkin) patch to do the Ceph latency breakdown.
34
35
h3. Detailed Description
36
37
(1)Workload generator
38
Cosbench is an open source Benchmarking tool developed by intel to measure Cloud Object Storage Service performance.
39
What we want to do is to extend CBT benchmark part, to support calling cosbench do the object(rgw) test.
40
 
41
(2)Analyzer
42
After investigating all current latency breakdown method in ceph, we think the BLKIN(LTTNG+ZIPKIN) patch seems to be the most promised approach.
43 2 Jessica Mack
44 1 Jessica Mack
Patch is first posted by Marios Kogias in August. http://www.spinics.net/lists/ceph-devel/msg19890.html
45
Currently owned by Andrew Shewmaker. https://github.com/agshew/ceph/tree/wip-blkin-v4
46 2 Jessica Mack
47 1 Jessica Mack
What we want to do is to
48 2 Jessica Mack
1. help to make the BLKIN patch bug less, stable and small overhead.
49
2. Add support in CBT to run a performance test with BLKIN.
50
3. Add more tracepoints followed BLKIN method to do "Latency Breakdown".
51 1 Jessica Mack
Exists problem
52 2 Jessica Mack
4. BLKIN patch is not merged to ceph master yet, so we need to rebase BLKIN patch to each new release.
53
5. Current BLKIN codes only cover the MOSDOP, MOSDSubop and MOSDRepop, may extents to other msg in the future.
54
6. Zipkin can only show the latency breakdown for one specific request. So the webpage will show thousands of requests, we need to summarize all request, add a page to show average relative latency and a timeline.
55
7. Can't link/merge client, primary and replica tracepoints in the same view. Because of multi OSD server system time delta problem, we aren't able to find a best solution yet to get the time delta by microseconds.
56 1 Jessica Mack
57
h3. Work items
58
59
h4. Coding tasks
60
61
Task 1:  make CBT support running cosbench test
62
Task 2:  BLKIN patch bug fixing.
63
Task 3:  script to deploy lttng and zipkin. 
64
65
h4. Build / release tasks
66
67
# Task 1
68
# Task 2
69
# Task 3
70
71
h4. Documentation tasks
72
73
# Task 1
74
# Task 2
75
# Task 3
76
77
h4. Deprecation tasks
78
79
# Task 1
80
# Task 2
81
# Task 3