Bug #48815
closedprometheus plugin recently fails
0%
Description
We are running 3 Ceph clusters, all in version 14.2.9.
In one of our cluster (the biggest one with ~400 OSDS and ~1PT) the ceph prometheus module is instable as f...
We are hitting this error:
curl:
HTTP/1.1 500 Internal Server Error
Date: Sat, 09 Jan 2021 10:12:50 GMT
Content-Length: 1753
Content-Type: text/html;charset=utf-8
Server: CherryPy/3.2.2
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
<title>500 Internal Server Error</title>
<style type="text/css">
#powered_by {
margin-top: 20px;
border-top: 2px solid black;
font-style: italic;
}
#traceback {
color: red;
}
</style>
</head>
<body>
<h2>500 Internal Server Error</h2>
<p>The server encountered an unexpected condition which prevented it from fulfilling the request.</p>
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 188, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 34, in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/share/ceph/mgr/prometheus/module.py", line 1060, in metrics
return self._metrics(instance)
File "/usr/share/ceph/mgr/prometheus/module.py", line 1074, in _metrics
instance.collect_cache = instance.collect()
File "/usr/share/ceph/mgr/prometheus/module.py", line 975, in collect
self.get_rbd_stats()
File "/usr/share/ceph/mgr/prometheus/module.py", line 734, in get_rbd_stats
'rbd_stats_pools_refresh_interval', 300)
TypeError: unsupported operand type(s) for +: 'int' and 'str'
<div id="powered_by">
<span>Powered by <a href="http://www.cherrypy.org&quot;&gt;CherryPy 3.2.2</a></span>
</div>
</body>
</html>
From the Log itself:
2021-01-09 11:22:26.087 7f280f729700 0 mgr[prometheus] [09/Jan/2021:11:22:26] HTTP Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 188, in call
self.body = self.oldhandler(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 34, in call
return self.callable(*self.args, **self.kwargs)
File "/usr/share/ceph/mgr/prometheus/module.py", line 1060, in metrics
return self._metrics(instance)
File "/usr/share/ceph/mgr/prometheus/module.py", line 1074, in _metrics
instance.collect_cache = instance.collect()
File "/usr/share/ceph/mgr/prometheus/module.py", line 975, in collect
self.get_rbd_stats()
File "/usr/share/ceph/mgr/prometheus/module.py", line 734, in get_rbd_stats
'rbd_stats_pools_refresh_interval', 300)
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Updated by Torsten Ennenbach over 3 years ago
Had a little debugging with my college here:
Michael:palme: vor 31 Minuten
I just had a look and it really looks like a simple type-cast is missing. But I also would like to check why it returns a string instead of the 300 that was given as the default value.
Maybe the config for "rbd_stats_pools_refresh_interval" is wrong and NOT an integer as the code expects here. Worth a little check wherever those configs are kept @TorstenE
Michael:palme: vor 27 Minuten
The logdata seems to indicate that indeed the value for rbd_stats_pools_refresh_interval is a string and not an integer in our configuration:
var/log/ceph/ceph-mon.mon4.log:2021-01-09 11:01:31.676 7f8f6ad9f700 0 mon.mon4@0(leader) e11 handle_command mon_command({"prefix": "config set", "who": "mgr", "name": "mgr/prometheus/rbd_stats_pools_refresh_interval", "value": "600"} v 0) v1
our fix:
[root@ceph-rbd-mon1 ~]# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
mon advanced mon_sync_max_payload_size 4096
mgr advanced mgr/balancer/active true
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/balancer/upmap_max_deviation 2
mgr advanced mgr/dashboard/ssl false *
mgr advanced mgr/prometheus/rbd_stats_pools rbd, archive *
mgr advanced mgr/prometheus/rbd_stats_pools_refresh_interval 600 *
mgr advanced mgr/prometheus/scrape_interval 300 *
mgr basic mgr_stats_period 5
osd advanced debug_bluefs 0/0
osd advanced debug_bluestore 0/0
osd advanced debug_rocksdb 0/0
osd advanced osd_mon_heartbeat_stat_stale 1
osd advanced osd_snap_trim_sleep 1.000000
[root@ceph-rbd-mon1 ~]#
to this:
[root@ceph-rbd-mon1 ~]# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
mon advanced mon_sync_max_payload_size 4096
mgr advanced mgr/balancer/active true
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/balancer/upmap_max_deviation 2
mgr advanced mgr/dashboard/ssl false *
mgr advanced mgr/prometheus/rbd_stats_pools rbd, archive *
mgr advanced mgr/prometheus/scrape_interval 300 *
mgr basic mgr_stats_period 5
osd advanced debug_bluefs 0/0
osd advanced debug_bluestore 0/0
osd advanced debug_rocksdb 0/0
osd advanced osd_mon_heartbeat_stat_stale 1
osd advanced osd_snap_trim_sleep 1.000000
via set: ceph config rm mgr mgr/prometheus/rbd_stats_pools_refresh_interval
and prometheus module works again
Updated by Torsten Ennenbach over 3 years ago
well only for a couple of minutes :(
Updated by Kefu Chai about 3 years ago
- Status changed from New to Can't reproduce
this issue was fixed in https://github.com/ceph/ceph/pull/35248/commits/3be27e0395cb79136555f792d67ef37c4f4c895a as a part of https://github.com/ceph/ceph/pull/35248.
please install 14.2.10 or up.