Feature #27049: mgr/dashboard: retrieve "Data Health" info from dashboard backend - Dashboard - Ceph

Actions

Copy link

Feature #27049

open

mgr/dashboard: retrieve "Data Health" info from dashboard backend

Added by Alfonso Martínez over 5 years ago. Updated about 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

General

Target version:

Ceph - v14.0.0

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

We want to add a chart/info in Dashboard Landing Page
that shows "Data Health" based on PG info as recommended by John Spray:

Ceph already internally has an opinion about the health of PGs
(resulting in the PG_DAMAGED, PG_UNAVAILABLE, PG_DEGRADED,
PG_DEGRADED_FULL health checks). You can see where that's done here:
https://github.com/ceph/ceph/blob/master/src/mon/PGMap.cc#L2179

The key point about this is that it's not the status of a PG that
users are really interested in: it's the health of their data (one
difference is that data can't be "working"). So the classifications
we have for the health are:
- Damaged: some (perhaps unrecoverable) data loss has occurred, Ceph
can't fix itself without help
- Degraded: the data is there and accessible, but not as redundant as
we would like
- Unavailable: we can't get at the data right now, but we believe
it's still stored somewhere.

So: if the goal is a simplified way to show PG health in the UI, my
suggestion is not to call it PG health at all, call it "data health"
and use exactly the same mappings that we use for the existing health
checks.

AFAIK, statistics that are calculated in the C++ code are required in order
to show the data correctly, but are not currently exposed to python modules.

The goal:
To be able to retrieve this data health group info from dashboard backend
in order to show it in Landing Page.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Alfonso Martínez over 5 years ago

Subject changed from mgr/dashboard: to mgr/dashboard: retrieve "Data Health" info from dashboard backend

Actions

Copy link

Updated by Alfonso Martínez over 5 years ago

Related to Feature #27050: mgr/dashboard: Landing Page Enhancements added

Actions

Copy link

Updated by Alfonso Martínez over 5 years ago

Related to Feature #24573: mgr/dashboard: Provide more "native" dashboard widgets to display live performance data added

Actions

Copy link

Updated by Lenz Grimmer over 5 years ago

Category set to 132
Target version set to v14.0.0

Actions

Copy link

Updated by Boris Ranto over 5 years ago

I have been looking into this and I have a couple of notes:

Internally (in the C++ code), we know all the PGs that are hitting a certain health check state. We are considering the four basic states, here: PG_AVAILABILITY, PG_DEGRADED, PG_DEGRADED_FULL, PG_DAMAGED.

These health checks are exposed to python with mgr.get('health'). The only way a health check can communicate and data is through a detail message (string). We could add a detail message containing all the PGs but I am not sure how the other developers would be looking at this. Furthermore, the maximum number of detail messages is configurable and afaik, it can be set to 0 so we still would not get the data.

Also, the health check states mentioned above are not mutually exclusive -- i.e. a single PG can be in more than a single health check state -- at least there is nothing making sure these states are mutually exclusive in the C++ code of the get_health_check function. Technically, they could be mutually exclusive by their nature but I doubt that since a PG can have several PG_STATE_* flags set.

The fact that they are not mutually exclusive makes the python code pretty slow. We need to send the detail message with the list of all failing PGs (this can be pretty big) and we need to make sure we don't count any PG twice in the final chart.

To add to that, there are other health checks that we might be interested in like OBJECT_MISPLACED.

That being said, I do have a proof of concept implementation of getting the chart data. If anyone is interested, we need the following two patches:

https://pastebin.com/06f8RTz3
https://pastebin.com/zfU3i8sf

Actions

Copy link