Project

General

Profile

Actions

Bug #63292

open

Prometheus exporter bug

Added by Navid Golpa 7 months ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading from Quincy to Reef the prometheus exporter stopped working. The /metrics endpoint now returns 503 with error message:

Traceback (most recent call last):
  File "/lib/python3/dist-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/lib/python3/dist-packages/cherrypy/lib/encoding.py", line 220, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/lib/python3/dist-packages/cherrypy/_cpdispatch.py", line 60, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1816, in metrics
    return self._metrics(_global_instance)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1827, in _metrics
    raise cherrypy.HTTPError(503, 'No cached data available yet')
cherrypy._cperror.HTTPError: (503, 'No cached data available yet')

In the mgr logs we see:

2023-10-23T11:32:05.351-0400 7fee7e97b700  0 [prometheus ERROR root] failed to collect metrics:
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 513, in collect
    data = self.mod.collect()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1637, in collect
    self.get_metadata_and_osd_status()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1121, in get_metadata_and_osd_status
    osd_map = self.get('osd_map')
  File "/usr/share/ceph/mgr/mgr_module.py", line 1366, in get
    obj = json.loads(obj)
  File "/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 4145 (char 4144)

Actions #1

Updated by Navid Golpa 6 months ago

We did some more digging and found the JSON that the mgr is trying to decode is indeed incorrect:

{
  ...
  "pools": [
    ...
    {
      "pool": 2,
      "pool_name": ".mgr",
      "read_balance": {
        "score_acting": inf,
        "score_stable": inf,
        "optimal_score": 0,
        ...
      },
      ...
    },
    ...
  ],
  ...
}

Inf is indeed not a valid value in JSON which is why the exception above is happening.

This bug is starting to affect us more severely as it keeps ceph in an error state preventing some of the routine tasks from running.

Actions #2

Updated by Nizamudeen A 6 months ago

Hi Navid,

From reef, the metrics are not exported by prometheus by default and instead a new daemon called ceph-exporter which runs on all hosts is used to export the performance metrics. You can read it here: https://docs.ceph.com/en/reef/monitoring/?highlight=ceph-exporter#ceph-metrics. Have you created the ceph-exporter service?

Actions #3

Updated by Navid Golpa 6 months ago

Hi,
Thank you for your response. I'm sorry but I didn't fully explain the problem. We initially discovered the problem because of the prometheus exporter but have since realized it's not limited to the prometheus exporter.

The devicehealth mgr module also suffers from the same problem and that is what is causing HEALTH_ERR state for us to appear in ceph:

$ sudo ceph health detail
...
[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed: Expecting value: line 1 column 4167 (char 4166)
    Module 'devicehealth' has failed: Expecting value: line 1 column 4167 (char 4166)
...

There may be other modules also suffering from it since the python exception is thrown by this line of code in mgr_module.py.

That function parses JSON using python's decoder. Unfortunately, the provided json string is not considered valid by the python json decoder which results in an exception which causes devicehealth module to fail.

The json the devicehealth module is trying to decode is the osdmap so it is arguable that code is generating the osdmap json shouldn't be using a value of inf. We're not even certain if inf is supposed to be a possible value for that section of the osdmap (possibly a divide by 0 bug somewhere?).

Our understanding of the ceph source code is very limited so please take our analysis with a grain of salt but the end result is that a non-standard json osdmap causes an unhandled exception in the mgr module which causes a HEALTH_ERR state in ceph.

Actions #4

Updated by Navid Golpa 6 months ago

Another command that's running into the same issue:

$ sudo ceph fs status
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1783, in _handle_command
    return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/status/module.py", line 185, in handle_fs_status
    osdmap = self.get("osd_map")
  File "/usr/share/ceph/mgr/mgr_module.py", line 1373, in get
    raise e
  File "/usr/share/ceph/mgr/mgr_module.py", line 1367, in get
    obj = json.loads(obj)
  File "/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 4167 (char 4166)
Actions #5

Updated by Navid Golpa 6 months ago

We did some more digging and believe the culprit is in OSDMap.cc/h. The rbi_round() function rounds anything below 0.005 to 0. When used in OSDMap::set_rbi() to compute the rbi.optimal_score that results in a value of 0 even though the comment on that line suggests the score should be >= 1. This results in a divide by 0 which results in the value of inf.

The scenario under which this happens is if you have lots of OSDs but a pool stored on very few OSDs as is the case with the .mgr pool. We have a cluster with 4109 OSDs and the .mgr pool by default is a replicated pool with size 3 and pg_num of 1. So that line computes 3/4109 =~ 0.0007 which is rounded to 0.

We confirmed we can fix this by increasing the pg_num of the .mgr pool from 1 to 32 which results in 32/4109 =~ 0.007 avoiding the rounding down to 0. Restarting the mgr after the increase made the error go away and ceph exited the ERR state.

Actions

Also available in: Atom PDF