Project

General

Profile

Actions

Feature #7192

closed

An easier-to-process health report

Added by John Spray over 10 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Currently, the "ceph health" is great for human consumption, but a bit awkward to feed into a monitoring app, because:
  • The notifications about what is wrong are human readable text rather than a fixed ID per type of problem
  • Sometimes we would like to be selective about what to alert users to, and it's not easy for us to filter health issues by type (without doing some ugly regexing or so).
  • There isn't one single list of health issues, we have to read out of "summary", "timechecks", "health" and understand the syntax of each.

It would be useful if each of the possible issues that "ceph health" can report were assigned a unique identifier, so that a third party app could easily filter out specific types of issue. The health command could then output a unified list the health checks, and the HEALTH_* status for each one, the kind of syntax that makes it trivial to display a "traffic light" view of system health.

My thoughts on implementing this are:

  • Invent a health_check_t struct which is a check name like "OSDS_NEAR_FULL" or "MONS_OFFLINE" and a string like the existing summary strings.
  • In Monitor::get_health the existing "list<pair<health_status_t,string> summary" structure and change the other get_health functions to populate it with health_check_t instead of just strings.
  • Make the 'detail' section of the output a dictionary of health check name to detail items, so that we can associate the detail strings with issues.
  • Feed the timechecks and HealthMonitor output into the same overall list of health checks, so that consumers of the output don't have to traverse these structures separately.
  • Add a new output mode where instead of only printing the health problems, a full list of all possible health checks and their status (even if HEALTH_OK) is printed.

So the output would be something like:

'checks': {
 'OSD_NEAR_FULL': {'severity': 'HEALTH_OK', 'message': null},
 'MONS_DOWN': {'severity': 'HEALTH_WARN', 'message': "1/3 mons is down"},
 'CLOCK_SKEW': {'severity': 'HEALTH_WARN', 'message': "Skew detected on foohost"}
 ... and so on for all health checks ...
},
'overall_status': 'HEALTH_WARN',
'detail': {
  'MONS_DOWN': ["mon barhost is down"],
}

It is quite verbose to include all the health checks including the OK ones, but remember that this is intended primarily for machine consumption, and there are only about 20 of these checks in total.

Actions

Also available in: Atom PDF