Live Performance Probes » History » Version 2
John Spray, 06/17/2015 06:56 PM
1 | 1 | John Spray | h1. Live Performance Probes (Blueprint) |
---|---|---|---|
2 | 1 | John Spray | |
3 | 1 | John Spray | Related ML post https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg24131.html |
4 | 1 | John Spray | Related branch: https://github.com/ceph/ceph/tree/wip-live-query |
5 | 1 | John Spray | |
6 | 2 | John Spray | h2. Summary |
7 | 1 | John Spray | |
8 | 1 | John Spray | Sometimes, we would like to gather performance data for logical entities like an RBD image, CephFS file, or a client ID. We might also like to see data broken down by operation type or size, within the scope of operations on such a logical entity. |
9 | 1 | John Spray | |
10 | 1 | John Spray | It is not practical to continuously collect enough data to support applying arbitrary queries retrospectively, because the number of records scales as the product of the number of axes by which we might want to filter the data. For example, if we maintained per-client per-rbd-image statistics, we would have to maintain n_clients*n_images records: too many to handle given that performance data needs updating very frequently to reflect each (or a substantial subset of) incoming operation. |
11 | 1 | John Spray | |
12 | 2 | John Spray | *Owners* |
13 | 2 | John Spray | John Spray (Red Hat) |
14 | 2 | John Spray | |
15 | 2 | John Spray | *Interested Parties* |
16 | 2 | John Spray | Name (Affiliation) |
17 | 2 | John Spray | |
18 | 2 | John Spray | *Current Status* |
19 | 2 | John Spray | Design |
20 | 2 | John Spray | |
21 | 2 | John Spray | h2. Detail |
22 | 1 | John Spray | |
23 | 1 | John Spray | It is proposed to create a mechanism to load performance probes into the ceph servers (especially the OSD but also the MDS) that would allow administrators to gather live statistics according to the particular breakdown that they are interested at a particular time. This would allow the administrator to "switch on" a per-rbd-image (GROUP BY rbd_image) bandwidth statistic (SELECT bandwidth FROM) in order to generate an "rbd top" view of busy images. Having identified a busy image, the administrator might enable a different probe that collected per-client bandwidth statistics for only that busy image (WHERE rbd_image=foo). |
24 | 1 | John Spray | |
25 | 1 | John Spray | The resulting set of statistics may still be rather large when there are many clients or many images. To handle that case, we add an option to collect only a limited-size set of results, where the results with the highest results are retained, and others are discarded. For example, we might keep a moving average of the rate of operations over the last second, and cull statistics records that were not in the top 100 (SORT BY movingaverage(ops) LIMIT 100). The actual number of results would have to fluctuate somewhat in order to give low-ranked records a chance to accumulate sufficient values to make it into the result set. |
26 | 1 | John Spray | |
27 | 1 | John Spray | Gathering the conditions from the last paragraphs into a pseudocode probe, we might implement rbd top by loading a probe like this to the OSDs: |
28 | 1 | John Spray | <pre> |
29 | 1 | John Spray | SELECT read_bytes, write_bytes, ops WHERE pool=rbd GROUP BY rbd_image SORT BY movingAverage(ops) LIMIT 100 |
30 | 1 | John Spray | </pre> |
31 | 1 | John Spray | |
32 | 1 | John Spray | Periodic sampling: Applying these probes for every operation may be excessive when coarse-grained information like rbd image ranking is needed from a high-throughput system: in these cases we might apply an additional sampling argument to the query, to e.g. apply the probe to only every 100th operation. |
33 | 1 | John Spray | |
34 | 1 | John Spray | Fixed-count sampling: For *very* busy systems we might seek to limit the potential impact of queries by saying "only collect N samples" or "only run until time=foo" for a query. That way one could fire out a query by loading it into the cluster, and then go collect the results at leisure, rather than worrying about having to hastily remove the query soon after loading it to avoid overloading. This would be a kind of "snapshot" mode to gather e.g. 1second or 10000 samples of something very detailed. |
35 | 1 | John Spray | |
36 | 1 | John Spray | |
37 | 1 | John Spray | h2. Details |
38 | 1 | John Spray | |
39 | 1 | John Spray | h3. Distribution of queries to daemons |
40 | 1 | John Spray | |
41 | 1 | John Spray | This might be done by inserting the queries to the OSDMap and exploiting the existing map-distribution mechanisms. Since OSDMaps are usually updated using deltas, the queries would not generally cause bloat in the network traffic. Alternatively, OSDs could be informed of queries on boot (the opposite of how OSDs report their metadata on boot), and updated subsequently with a special purpose message in order to avoid overloading the (critical) osdmap with these (advisory) probe queries. |
42 | 1 | John Spray | |
43 | 1 | John Spray | h3. Gathering of results from daemons |
44 | 1 | John Spray | |
45 | 1 | John Spray | Each daemon would maintain a table of results for each query. Initially, it would be straightforward to expose these results as JSON via an admin socket command, and delegate the task of gathering statistics across the cluster to some external tool. |
46 | 1 | John Spray | |
47 | 1 | John Spray | One could send them via the mon, but that would be an undesirable additional load on the mons. However, it might be interesting in the future to exploit the concept of a "passive mon" to act as a statistics server, and pass the daemons the address of a passive monitor (along with our probe queries) to act as a statistics gateway. |
48 | 1 | John Spray | |
49 | 1 | John Spray | Ultimately the job of applying a reduction to the stats across the whole cluster might be the job of a specialised stats database like influxdb, but for purely live "top" views something much simpler will suffice. |
50 | 1 | John Spray | |
51 | 1 | John Spray | h3. RGW/RBD/CephFS hooks |
52 | 1 | John Spray | |
53 | 1 | John Spray | In the above example, we used a "GROUP BY rbd_image" argument. Of course, and OSD doesn't know anything about RBD. In order to apply this kind of high level argument in a probe query, there needs to be a plugin mechanism, similar to object classes. Plugins would declare that they could generate certain named attributes from an operation, and then the OSD would call those plugins needed to gather the attributes used in a query. For example, an rbd plugin would say "I can give you rbd_image", and then the OSD would recognise that "rbd_image" was needed for the query, and for each operation say to the plugin "Here's an operation, tell me the value of rbd_image for this operation". The plugin would derive that from the object ID, or indicate that the attribute is unavailable (i.e. this isn't an RBD operation). |
54 | 1 | John Spray | |
55 | 1 | John Spray | h3. The query language |
56 | 1 | John Spray | |
57 | 1 | John Spray | It may not be necessary to implement a human-readable SQL-like grammar. Since these queries would usually be generated by tools (e.g. rbd top, or a gui), they could have the syntax outlined above, but be written as JSON to avoid implementing a parser. |
58 | 1 | John Spray | |
59 | 1 | John Spray | h3. More examples |
60 | 1 | John Spray | |
61 | 1 | John Spray | GROUP BY/WHERE (OSD): object_id EQ, object_id STARTSWITH, cephfs_ino, pool, rbd_image, rgw_object, client_id, rados_op_type |
62 | 1 | John Spray | GROUP BY/WHERE (MDS): ino, parent_ino, metadata_op_type, dname, |
63 | 1 | John Spray | |
64 | 1 | John Spray | SELECT (OSD): ops, read_bytes, write_bytes |
65 | 1 | John Spray | SELECT (MDS): ops, cap_issues |
66 | 1 | John Spray | |
67 | 1 | John Spray | See which folders in the top level of a cephfs filesystem are busy with metadata operations: |
68 | 1 | John Spray | SELECT ops WHERE parent_ino=1 GROUP_BY dname |
69 | 1 | John Spray | |
70 | 1 | John Spray | Having identified a folder, what kind of ops are going on in there: |
71 | 1 | John Spray | SELECT ops WHERE parent_ino=123 GROUP_BY metadata_op_type |