Project

General

Profile

A standard framework for Ceph performance profiling with latency breakdown » History » Version 1

Jessica Mack, 07/06/2015 08:59 PM

1 1 Jessica Mack
h1. A standard framework for Ceph performance profiling with latency breakdown
2
3
h3. Summary
4
5
When we working on Ceph performance evaluation and optimization, we found how to trouble shoot the bottlenecks, identify the best tuning knobs from many parameters and handle the unexpected performance regression between different releases is pretty difficult. So we'd like to propose a general framework based on CBT and existing LTTng and other components we need to add to build a user friendly framework to profile and tune Ceph performance. 
6
7
h3. Owners
8
9
* Chendi Xue (Intel)
10
* Jian Zhang (Intel)
11
* Mark Nelson (Red Hat)
12
13
h3. Interested Parties
14
15
* Andrew Shewmaker (Red Hat, UCSC)
16
* Danny Al-Gaaf (Deutsche Telekom)
17
* Name
18
19
h3. Current Status
20
21
 From our point of view, the framework should comprises three parts of work:
22
(1) Deployment
23
CBT to deploy the Ceph cluster
24
script to deploy lttng and zipkin.
25
(2) Workload generator,
26
Enable COSBench as a CBT Object storage plugin, Integrate COSBench into CBT as the object(rgw) workload generator.
27
Current proposed approach is:
28
Let the end user to deploy COSBench
29
Add a plugin model in CBT to extract the COSBench related parameters and translate it to COSBench XML files
30
Use the plugin to kick off COSBench to generate  the load and get the data
31
(3) Analyzer
32
Adding more trace points
33
Leverage blkin (LTTng + Zipkin) patch to do the Ceph latency breakdown.
34
35
h3. Detailed Description
36
37
(1)Workload generator
38
Cosbench is an open source Benchmarking tool developed by intel to measure Cloud Object Storage Service performance.
39
What we want to do is to extend CBT benchmark part, to support calling cosbench do the object(rgw) test.
40
 
41
(2)Analyzer
42
After investigating all current latency breakdown method in ceph, we think the BLKIN(LTTNG+ZIPKIN) patch seems to be the most promised approach.
43
Patch is first posted by Marios Kogias in August. http://www.spinics.net/lists/ceph-devel/msg19890.html
44
Currently owned by Andrew Shewmaker. https://github.com/agshew/ceph/tree/wip-blkin-v4
45
What we want to do is to
46
help to make the BLKIN patch bug less, stable and small overhead.
47
Add support in CBT to run a performance test with BLKIN.
48
Add more tracepoints followed BLKIN method to do "Latency Breakdown".
49
Exists problem
50
BLKIN patch is not merged to ceph master yet, so we need to rebase BLKIN patch to each new release.
51
Current BLKIN codes only cover the MOSDOP, MOSDSubop and MOSDRepop, may extents to other msg in the future.
52
Zipkin can only show the latency breakdown for one specific request. So the webpage will show thousands of requests, we need to summarize all request, add a page to show average relative latency and a timeline.
53
Can't link/merge client, primary and replica tracepoints in the same view. Because of multi OSD server system time delta problem, we aren't able to find a best solution yet to get the time delta by microseconds.
54
55
h3. Work items
56
57
h4. Coding tasks
58
59
Task 1:  make CBT support running cosbench test
60
Task 2:  BLKIN patch bug fixing.
61
Task 3:  script to deploy lttng and zipkin. 
62
63
h4. Build / release tasks
64
65
# Task 1
66
# Task 2
67
# Task 3
68
69
h4. Documentation tasks
70
71
# Task 1
72
# Task 2
73
# Task 3
74
75
h4. Deprecation tasks
76
77
# Task 1
78
# Task 2
79
# Task 3