Live Performance Probes » History » Version 2

John Spray, 06/17/2015 06:56 PM

1 1 John Spray
h1. Live Performance Probes (Blueprint)
2 1 John Spray
3 1 John Spray
Related ML post
4 1 John Spray
Related branch:
5 1 John Spray
6 2 John Spray
h2. Summary
7 1 John Spray
8 1 John Spray
Sometimes, we would like to gather performance data for logical entities like an RBD image, CephFS file, or a client ID.  We might also like to see data broken down by operation type or size, within the scope of operations on such a logical entity.
9 1 John Spray
10 1 John Spray
It is not practical to continuously collect enough data to support applying arbitrary queries retrospectively, because the number of records scales as the product of the number of axes by which we might want to filter the data.  For example, if we maintained per-client per-rbd-image statistics, we would have to maintain n_clients*n_images records: too many to handle given that performance data needs updating very frequently to reflect each (or a substantial subset of) incoming operation.
11 1 John Spray
12 2 John Spray
13 2 John Spray
John Spray (Red Hat)
14 2 John Spray
15 2 John Spray
*Interested Parties*
16 2 John Spray
Name (Affiliation)
17 2 John Spray
18 2 John Spray
*Current Status*
19 2 John Spray
20 2 John Spray
21 2 John Spray
h2. Detail
22 1 John Spray
23 1 John Spray
It is proposed to create a mechanism to load performance probes into the ceph servers (especially the OSD but also the MDS) that would allow administrators to gather live statistics according to the particular breakdown that they are interested at a particular time.  This would allow the administrator to "switch on" a per-rbd-image (GROUP BY rbd_image) bandwidth statistic (SELECT bandwidth FROM) in order to generate an "rbd top" view of busy images.  Having identified a busy image, the administrator might enable a different probe that collected per-client bandwidth statistics for only that busy image (WHERE rbd_image=foo).
24 1 John Spray
25 1 John Spray
The resulting set of statistics may still be rather large when there are many clients or many images.  To handle that case, we add an option to collect only a limited-size set of results, where the results with the highest results are retained, and others are discarded.  For example, we might keep a moving average of the rate of operations over the last second, and cull statistics records that were not in the top 100 (SORT BY movingaverage(ops) LIMIT 100).  The actual number of results would have to fluctuate somewhat in order to give low-ranked records a chance to accumulate sufficient values to make it into the result set.
26 1 John Spray
27 1 John Spray
Gathering the conditions from the last paragraphs into a pseudocode probe, we might implement rbd top by loading a probe like this to the OSDs:
28 1 John Spray
29 1 John Spray
SELECT read_bytes, write_bytes, ops WHERE pool=rbd GROUP BY rbd_image SORT BY movingAverage(ops) LIMIT 100
30 1 John Spray
31 1 John Spray
32 1 John Spray
Periodic sampling: Applying these probes for every operation may be excessive when coarse-grained information like rbd image ranking is needed from a high-throughput system: in these cases we might apply an additional sampling argument to the query, to e.g. apply the probe to only every 100th operation.
33 1 John Spray
34 1 John Spray
Fixed-count sampling: For *very* busy systems we might seek to limit the potential impact of queries by saying "only collect N samples" or "only run until time=foo" for a query.  That way one could fire out a query by loading it into the cluster, and then go collect the results at leisure, rather than worrying about having to hastily remove the query soon after loading it to avoid overloading.  This would be a kind of "snapshot" mode to gather e.g. 1second or 10000 samples of something very detailed.
35 1 John Spray
36 1 John Spray
37 1 John Spray
h2. Details
38 1 John Spray
39 1 John Spray
h3. Distribution of queries to daemons
40 1 John Spray
41 1 John Spray
This might be done by inserting the queries to the OSDMap and exploiting the existing map-distribution mechanisms.  Since OSDMaps are usually updated using deltas, the queries would not generally cause bloat in the network traffic.  Alternatively, OSDs could be informed of queries on boot (the opposite of how OSDs report their metadata on boot), and updated subsequently with a special purpose message in order to avoid overloading the (critical) osdmap with these (advisory) probe queries.
42 1 John Spray
43 1 John Spray
h3. Gathering of results from daemons
44 1 John Spray
45 1 John Spray
Each daemon would maintain a table of results for each query.  Initially, it would be straightforward to expose these results as JSON via an admin socket command, and delegate the task of gathering statistics across the cluster to some external tool.
46 1 John Spray
47 1 John Spray
One could send them via the mon, but that would be an undesirable additional load on the mons.  However, it might be interesting in the future to exploit the concept of a "passive mon" to act as a statistics server, and pass the daemons the address of a passive monitor (along with our probe queries) to act as a statistics gateway.
48 1 John Spray
49 1 John Spray
Ultimately the job of applying a reduction to the stats across the whole cluster might be the job of a specialised stats database like influxdb, but for purely live "top" views something much simpler will suffice.
50 1 John Spray
51 1 John Spray
h3. RGW/RBD/CephFS hooks
52 1 John Spray
53 1 John Spray
In the above example, we used a "GROUP BY rbd_image" argument.  Of course, and OSD doesn't know anything about RBD.  In order to apply this kind of high level argument in a probe query, there needs to be a plugin mechanism, similar to object classes.  Plugins would declare that they could generate certain named attributes from an operation, and then the OSD would call those plugins needed to gather the attributes used in a query.  For example, an rbd plugin would say "I can give you rbd_image", and then the OSD would recognise that "rbd_image" was needed for the query, and for each operation say to the plugin "Here's an operation, tell me the value of rbd_image for this operation".  The plugin would derive that from the object ID, or indicate that the attribute is unavailable (i.e. this isn't an RBD operation).
54 1 John Spray
55 1 John Spray
h3. The query language
56 1 John Spray
57 1 John Spray
It may not be necessary to implement a human-readable SQL-like grammar.  Since these queries would usually be generated by tools (e.g. rbd top, or a gui), they could have the syntax outlined above, but be written as JSON to avoid implementing a parser.
58 1 John Spray
59 1 John Spray
h3. More examples
60 1 John Spray
61 1 John Spray
GROUP BY/WHERE (OSD): object_id EQ, object_id STARTSWITH, cephfs_ino, pool, rbd_image, rgw_object, client_id, rados_op_type
62 1 John Spray
GROUP BY/WHERE (MDS): ino, parent_ino, metadata_op_type, dname, 
63 1 John Spray
64 1 John Spray
SELECT (OSD): ops, read_bytes, write_bytes
65 1 John Spray
SELECT (MDS): ops, cap_issues
66 1 John Spray
67 1 John Spray
See which folders in the top level of a cephfs filesystem are busy with metadata operations:
68 1 John Spray
SELECT ops WHERE parent_ino=1 GROUP_BY dname
69 1 John Spray
70 1 John Spray
Having identified a folder, what kind of ops are going on in there:
71 1 John Spray
SELECT ops WHERE parent_ino=123 GROUP_BY metadata_op_type