Live Performance Probes » History » Version 2

« Previous - Version 2/3 (diff) - Next » - Current version
John Spray, 06/17/2015 06:56 PM

Live Performance Probes (Blueprint)

Related ML post
Related branch:


Sometimes, we would like to gather performance data for logical entities like an RBD image, CephFS file, or a client ID. We might also like to see data broken down by operation type or size, within the scope of operations on such a logical entity.

It is not practical to continuously collect enough data to support applying arbitrary queries retrospectively, because the number of records scales as the product of the number of axes by which we might want to filter the data. For example, if we maintained per-client per-rbd-image statistics, we would have to maintain n_clients*n_images records: too many to handle given that performance data needs updating very frequently to reflect each (or a substantial subset of) incoming operation.

John Spray (Red Hat)

Interested Parties
Name (Affiliation)

Current Status


It is proposed to create a mechanism to load performance probes into the ceph servers (especially the OSD but also the MDS) that would allow administrators to gather live statistics according to the particular breakdown that they are interested at a particular time. This would allow the administrator to "switch on" a per-rbd-image (GROUP BY rbd_image) bandwidth statistic (SELECT bandwidth FROM) in order to generate an "rbd top" view of busy images. Having identified a busy image, the administrator might enable a different probe that collected per-client bandwidth statistics for only that busy image (WHERE rbd_image=foo).

The resulting set of statistics may still be rather large when there are many clients or many images. To handle that case, we add an option to collect only a limited-size set of results, where the results with the highest results are retained, and others are discarded. For example, we might keep a moving average of the rate of operations over the last second, and cull statistics records that were not in the top 100 (SORT BY movingaverage(ops) LIMIT 100). The actual number of results would have to fluctuate somewhat in order to give low-ranked records a chance to accumulate sufficient values to make it into the result set.

Gathering the conditions from the last paragraphs into a pseudocode probe, we might implement rbd top by loading a probe like this to the OSDs:

SELECT read_bytes, write_bytes, ops WHERE pool=rbd GROUP BY rbd_image SORT BY movingAverage(ops) LIMIT 100

Periodic sampling: Applying these probes for every operation may be excessive when coarse-grained information like rbd image ranking is needed from a high-throughput system: in these cases we might apply an additional sampling argument to the query, to e.g. apply the probe to only every 100th operation.

Fixed-count sampling: For very busy systems we might seek to limit the potential impact of queries by saying "only collect N samples" or "only run until time=foo" for a query. That way one could fire out a query by loading it into the cluster, and then go collect the results at leisure, rather than worrying about having to hastily remove the query soon after loading it to avoid overloading. This would be a kind of "snapshot" mode to gather e.g. 1second or 10000 samples of something very detailed.


Distribution of queries to daemons

This might be done by inserting the queries to the OSDMap and exploiting the existing map-distribution mechanisms. Since OSDMaps are usually updated using deltas, the queries would not generally cause bloat in the network traffic. Alternatively, OSDs could be informed of queries on boot (the opposite of how OSDs report their metadata on boot), and updated subsequently with a special purpose message in order to avoid overloading the (critical) osdmap with these (advisory) probe queries.

Gathering of results from daemons

Each daemon would maintain a table of results for each query. Initially, it would be straightforward to expose these results as JSON via an admin socket command, and delegate the task of gathering statistics across the cluster to some external tool.

One could send them via the mon, but that would be an undesirable additional load on the mons. However, it might be interesting in the future to exploit the concept of a "passive mon" to act as a statistics server, and pass the daemons the address of a passive monitor (along with our probe queries) to act as a statistics gateway.

Ultimately the job of applying a reduction to the stats across the whole cluster might be the job of a specialised stats database like influxdb, but for purely live "top" views something much simpler will suffice.

RGW/RBD/CephFS hooks

In the above example, we used a "GROUP BY rbd_image" argument. Of course, and OSD doesn't know anything about RBD. In order to apply this kind of high level argument in a probe query, there needs to be a plugin mechanism, similar to object classes. Plugins would declare that they could generate certain named attributes from an operation, and then the OSD would call those plugins needed to gather the attributes used in a query. For example, an rbd plugin would say "I can give you rbd_image", and then the OSD would recognise that "rbd_image" was needed for the query, and for each operation say to the plugin "Here's an operation, tell me the value of rbd_image for this operation". The plugin would derive that from the object ID, or indicate that the attribute is unavailable (i.e. this isn't an RBD operation).

The query language

It may not be necessary to implement a human-readable SQL-like grammar. Since these queries would usually be generated by tools (e.g. rbd top, or a gui), they could have the syntax outlined above, but be written as JSON to avoid implementing a parser.

More examples

GROUP BY/WHERE (OSD): object_id EQ, object_id STARTSWITH, cephfs_ino, pool, rbd_image, rgw_object, client_id, rados_op_type
GROUP BY/WHERE (MDS): ino, parent_ino, metadata_op_type, dname,

SELECT (OSD): ops, read_bytes, write_bytes
SELECT (MDS): ops, cap_issues

See which folders in the top level of a cephfs filesystem are busy with metadata operations:
SELECT ops WHERE parent_ino=1 GROUP_BY dname

Having identified a folder, what kind of ops are going on in there:
SELECT ops WHERE parent_ino=123 GROUP_BY metadata_op_type