Feature #7192: An easier-to-process health report - Ceph - Ceph

Actions

Copy link

Feature #7192

closed

An easier-to-process health report

Added by John Spray over 10 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Monitor

Target version:

% Done:

Source:

other

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

Currently, the "ceph health" is great for human consumption, but a bit awkward to feed into a monitoring app, because:

The notifications about what is wrong are human readable text rather than a fixed ID per type of problem
Sometimes we would like to be selective about what to alert users to, and it's not easy for us to filter health issues by type (without doing some ugly regexing or so).
There isn't one single list of health issues, we have to read out of "summary", "timechecks", "health" and understand the syntax of each.

It would be useful if each of the possible issues that "ceph health" can report were assigned a unique identifier, so that a third party app could easily filter out specific types of issue. The health command could then output a unified list the health checks, and the HEALTH_* status for each one, the kind of syntax that makes it trivial to display a "traffic light" view of system health.

My thoughts on implementing this are:

Invent a health_check_t struct which is a check name like "OSDS_NEAR_FULL" or "MONS_OFFLINE" and a string like the existing summary strings.
In Monitor::get_health the existing "list<pair<health_status_t,string> summary" structure and change the other get_health functions to populate it with health_check_t instead of just strings.
Make the 'detail' section of the output a dictionary of health check name to detail items, so that we can associate the detail strings with issues.
Feed the timechecks and HealthMonitor output into the same overall list of health checks, so that consumers of the output don't have to traverse these structures separately.
Add a new output mode where instead of only printing the health problems, a full list of all possible health checks and their status (even if HEALTH_OK) is printed.

So the output would be something like:

'checks': {
 'OSD_NEAR_FULL': {'severity': 'HEALTH_OK', 'message': null},
 'MONS_DOWN': {'severity': 'HEALTH_WARN', 'message': "1/3 mons is down"},
 'CLOCK_SKEW': {'severity': 'HEALTH_WARN', 'message': "Skew detected on foohost"}
 ... and so on for all health checks ...
},
'overall_status': 'HEALTH_WARN',
'detail': {
  'MONS_DOWN': ["mon barhost is down"],
}

It is quite verbose to include all the health checks including the OK ones, but remember that this is intended primarily for machine consumption, and there are only about 20 of these checks in total.

Actions

Copy link

Updated by Sage Weil about 10 years ago

This sounds great to me. I assume we'd keep the 'detail' section as optional as it can get quite (!) big.

Actions

Copy link

Updated by Sage Weil almost 7 years ago

Two questions:

1. Should the severity be a property of the error code? e.g., we define a table of possible error codes, each with an associated severity. Otherwise, you might see OSD_NEAR_FULL appear with either a warning, info, or error severity.

2. Is a list of strings sufficient for the detail? I wonder if we'd want something more structured (like a json dump).

Actions

Copy link

Updated by John Spray almost 7 years ago

1. We should probably internally associate the severity with the type, but in the output we'd still be writing it as a separate thing for each line because otherwise there would need to be a separate interface for learning the severity of each error type (or the caller would have to invent their own mapping of type-severity).

2. Hmm, there's certainly scope for more detail (the strings are just the essential thing so that a dumb consumer can print something)

So when the strings are like ["osd 1 is down", "osd 2 is down"]...
We could do this:
[{"message": "osd 1 is down", osd_id: 1}...]
or this:
["messages": ["osd 1 is down", "osd 2 is down"], osds: [1,2]]

My brain isn't expressing a strong preference at this moment, but I'm feeling a little stupid generally so there might be a good reason for one vs. the other when we get into it more.

Actions

Copy link