mgr/dashboard: retrieve "Data Health" info from dashboard backend
We want to add a chart/info in Dashboard Landing Page
that shows "Data Health" based on PG info as recommended by John Spray:
Ceph already internally has an opinion about the health of PGs
(resulting in the PG_DAMAGED, PG_UNAVAILABLE, PG_DEGRADED,
PG_DEGRADED_FULL health checks). You can see where that's done here:
The key point about this is that it's not the status of a PG that
users are really interested in: it's the health of their data (one
difference is that data can't be "working"). So the classifications
we have for the health are:
- Damaged: some (perhaps unrecoverable) data loss has occurred, Ceph
can't fix itself without help
- Degraded: the data is there and accessible, but not as redundant as
we would like
- Unavailable: we can't get at the data right now, but we believe
it's still stored somewhere.
So: if the goal is a simplified way to show PG health in the UI, my
suggestion is not to call it PG health at all, call it "data health"
and use exactly the same mappings that we use for the existing health
AFAIK, statistics that are calculated in the C++ code are required in order
to show the data correctly, but are not currently exposed to python modules.
To be able to retrieve this data health group info from dashboard backend
in order to show it in Landing Page.
#5 Updated by Boris Ranto 6 months ago
I have been looking into this and I have a couple of notes:
Internally (in the C++ code), we know all the PGs that are hitting a certain health check state. We are considering the four basic states, here: PG_AVAILABILITY, PG_DEGRADED, PG_DEGRADED_FULL, PG_DAMAGED.
These health checks are exposed to python with mgr.get('health'). The only way a health check can communicate and data is through a detail message (string). We could add a detail message containing all the PGs but I am not sure how the other developers would be looking at this. Furthermore, the maximum number of detail messages is configurable and afaik, it can be set to 0 so we still would not get the data.
Also, the health check states mentioned above are not mutually exclusive -- i.e. a single PG can be in more than a single health check state -- at least there is nothing making sure these states are mutually exclusive in the C++ code of the get_health_check function. Technically, they could be mutually exclusive by their nature but I doubt that since a PG can have several PG_STATE_* flags set.
The fact that they are not mutually exclusive makes the python code pretty slow. We need to send the detail message with the list of all failing PGs (this can be pretty big) and we need to make sure we don't count any PG twice in the final chart.
To add to that, there are other health checks that we might be interested in like OBJECT_MISPLACED.
That being said, I do have a proof of concept implementation of getting the chart data. If anyone is interested, we need the following two patches: