Project

General

Profile

Feature #50980

Updated by Ernesto Puerta 10 months ago

h3. Description

The goal of this feature would be to provide a visual representation of the cluster status covering the following aspects:
* Data Cluster/Data integrity:
** Goal: Highlight damaged or compromised sections of the cluster. It should help relate how a failure in one cluster item (PG, OSD, host, etc) relates to degraded status in other items (e.g.: a failure in one OSD causes its PGs to become degraded and hence peer PGs in other OSDs will become affected).
** What: The suitable entity/es for this could be hosts, OSDs, or PGs (data placement items).
** How: A representation that could fit this goal for this would be ring-like (e.g.: "Cassandra OpsCenter":https://docs.datastax.com/en/opscenter/6.1/opsc/online_help/opscNodeAdminRing.html) or it could also be a "multi-level ring/pie aka Sunburst chart":https://en.wikipedia.org/wiki/Pie_chart#Ring (inner ring for hosts, middle ring for OSDs belonging to that host, and outer ring for PGs belonging to that OSD).
** Sample scenario: "a given OSD 123 goes down (due to an I/O disk error), then: the OSD 123 stripe in the OSD ring turns red, the inner ring (host) turns orange (warning), and the outer ring (PGs) turn orange (warning). Additionally, the PG replicas in other OSDs would also turn orange (since they are degraded: a replica for each PG is lost).
** The position of adjacent hosts/OSDs/PGs in the ring may follow some topological criteria (even allowing for more inner rings: hosts belonging to same "rack, chassis, row, ... datacenter, etc.":https://docs.ceph.com/en/latest/rados/operations/crush-map/#types-and-buckets)
* Data movement: Cluster throughput/traffic:
** Goal: visualize how data moves within the cluster and identify hot-spots and bottlenecks (replica writes, erasure-code chunks read/writes). Client traffic (inbound-outbound) could be optionally left out (client traffic will mostly trigger cluster traffic).
** What: it's not realistic to display every byte/request traversing the cluster network, so a threshold should be applied
** How: A representation suitable for this could be the "Chord diagram":https://observablehq.com/@d3/chord-dependency-diagram which allows to map flows among segments/stripes in a ring (also in multi-level one). In Chord diagrams the color and width of the segments can be used to describe
* Data placement:
** Goal: detect hot-stops and imbalances in data distribution.
** Is this really an issue after PG balancer module?


h3. References:

Sage suggested using "advanced visualization tools":https://www.d3-graph-gallery.com/bundle for displaying the cluster-OSD recovery I/O:

* "Hierarchical edge bundling":https://observablehq.com/@d3/hierarchical-edge-bundling (and "Hierarchical Edge Bundles:
Visualization of Adjacency Relations in Hierarchical Data paper":https://aviz.fr/wiki/uploads/Teaching2014/bundles_infovis.pdf)
* "chord dependency":https://observablehq.com/@d3/chord-dependency-diagram

More references or examples of cluster visualization:

* "Example of a ring view in Cassandra OpsCenter":https://docs.datastax.com/en/opscenter/6.1/opsc/online_help/opscNodeAdminRing.html.
* "Presentation":https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2018/Projects/Final/wyuan.pptx
* "Netflix's Vizceral":https://netflixtechblog.com/vizceral-open-source-acc0c32113fe

Back