Project

General

Profile

A standard framework for Ceph performance profiling with latency breakdown » History » Version 3

Jessica Mack, 07/06/2015 09:01 PM

1 1 Jessica Mack
h1. A standard framework for Ceph performance profiling with latency breakdown
2
3
h3. Summary
4
5
When we working on Ceph performance evaluation and optimization, we found how to trouble shoot the bottlenecks, identify the best tuning knobs from many parameters and handle the unexpected performance regression between different releases is pretty difficult. So we'd like to propose a general framework based on CBT and existing LTTng and other components we need to add to build a user friendly framework to profile and tune Ceph performance. 
6
7
h3. Owners
8
9
* Chendi Xue (Intel)
10
* Jian Zhang (Intel)
11
* Mark Nelson (Red Hat)
12
13
h3. Interested Parties
14
15
* Andrew Shewmaker (Red Hat, UCSC)
16
* Danny Al-Gaaf (Deutsche Telekom)
17
* Name
18
19
h3. Current Status
20
21 3 Jessica Mack
From our point of view, the framework should comprises three parts of work:
22 1 Jessica Mack
(1) Deployment
23 3 Jessica Mack
# CBT to deploy the Ceph cluster
24
# script to deploy lttng and zipkin.
25
26 1 Jessica Mack
(2) Workload generator,
27 3 Jessica Mack
# Enable COSBench as a CBT Object storage plugin, Integrate COSBench into CBT as the object(rgw) workload generator.
28
# Current proposed approach is:
29
# Let the end user to deploy COSBench
30
# Add a plugin model in CBT to extract the COSBench related parameters and translate it to COSBench XML files
31
# Use the plugin to kick off COSBench to generate  the load and get the data
32
33 1 Jessica Mack
(3) Analyzer
34 3 Jessica Mack
# Adding more trace points
35
# Leverage blkin (LTTng + Zipkin) patch to do the Ceph latency breakdown.
36 1 Jessica Mack
37
h3. Detailed Description
38
39
(1)Workload generator
40
Cosbench is an open source Benchmarking tool developed by intel to measure Cloud Object Storage Service performance.
41
What we want to do is to extend CBT benchmark part, to support calling cosbench do the object(rgw) test.
42
 
43
(2)Analyzer
44 3 Jessica Mack
After investigating all current latency breakdown method in ceph, we think the BLKIN (LTTNG+ZIPKIN) patch seems to be the most promised approach.
45 1 Jessica Mack
46
Patch is first posted by Marios Kogias in August. http://www.spinics.net/lists/ceph-devel/msg19890.html
47 2 Jessica Mack
Currently owned by Andrew Shewmaker. https://github.com/agshew/ceph/tree/wip-blkin-v4
48 1 Jessica Mack
49 2 Jessica Mack
What we want to do is to
50
1. help to make the BLKIN patch bug less, stable and small overhead.
51
2. Add support in CBT to run a performance test with BLKIN.
52 1 Jessica Mack
3. Add more tracepoints followed BLKIN method to do "Latency Breakdown".
53 2 Jessica Mack
Exists problem
54
4. BLKIN patch is not merged to ceph master yet, so we need to rebase BLKIN patch to each new release.
55
5. Current BLKIN codes only cover the MOSDOP, MOSDSubop and MOSDRepop, may extents to other msg in the future.
56
6. Zipkin can only show the latency breakdown for one specific request. So the webpage will show thousands of requests, we need to summarize all request, add a page to show average relative latency and a timeline.
57 1 Jessica Mack
7. Can't link/merge client, primary and replica tracepoints in the same view. Because of multi OSD server system time delta problem, we aren't able to find a best solution yet to get the time delta by microseconds.
58
59
h3. Work items
60
61
h4. Coding tasks
62
63
Task 1:  make CBT support running cosbench test
64
Task 2:  BLKIN patch bug fixing.
65
Task 3:  script to deploy lttng and zipkin. 
66
67
h4. Build / release tasks
68
69
# Task 1
70
# Task 2
71
# Task 3
72
73
h4. Documentation tasks
74
75
# Task 1
76
# Task 2
77
# Task 3
78
79
h4. Deprecation tasks
80
81
# Task 1
82
# Task 2
83
# Task 3