Project

General

Profile

Actions

Bug #48815

closed

prometheus plugin recently fails

Added by Torsten Ennenbach over 3 years ago. Updated about 3 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Tags:
14.2.9 prometheus ceph-mgr
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are running 3 Ceph clusters, all in version 14.2.9.

In one of our cluster (the biggest one with ~400 OSDS and ~1PT) the ceph prometheus module is instable as f...
We are hitting this error:

curl:

HTTP/1.1 500 Internal Server Error
Date: Sat, 09 Jan 2021 10:12:50 GMT
Content-Length: 1753
Content-Type: text/html;charset=utf-8
Server: CherryPy/3.2.2

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
<title>500 Internal Server Error</title>
<style type="text/css">
#powered_by {
margin-top: 20px;
border-top: 2px solid black;
font-style: italic;
}

#traceback {
color: red;
}
&lt;/style&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;h2&gt;500 Internal Server Error&lt;/h2&gt;
&lt;p&gt;The server encountered an unexpected condition which prevented it from fulfilling the request.&lt;/p&gt;
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 656, in respond
    response.body = self.handler()
  File "/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 188, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 34, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1060, in metrics
    return self._metrics(instance)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1074, in _metrics
    instance.collect_cache = instance.collect()
  File "/usr/share/ceph/mgr/prometheus/module.py", line 975, in collect
    self.get_rbd_stats()
  File "/usr/share/ceph/mgr/prometheus/module.py", line 734, in get_rbd_stats
    'rbd_stats_pools_refresh_interval', 300)
TypeError: unsupported operand type(s) for +: 'int' and 'str'

&lt;div id=&quot;powered_by&quot;&gt;
&lt;span&gt;Powered by &lt;a href=&quot;http://www.cherrypy.org&amp;quot;&amp;gt;CherryPy 3.2.2&lt;/a&gt;&lt;/span&gt;
&lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;

From the Log itself:

2021-01-09 11:22:26.087 7f280f729700 0 mgr[prometheus] [09/Jan/2021:11:22:26] HTTP Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 188, in call
self.body = self.oldhandler(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 34, in call
return self.callable(*self.args, **self.kwargs)
File "/usr/share/ceph/mgr/prometheus/module.py", line 1060, in metrics
return self._metrics(instance)
File "/usr/share/ceph/mgr/prometheus/module.py", line 1074, in _metrics
instance.collect_cache = instance.collect()
File "/usr/share/ceph/mgr/prometheus/module.py", line 975, in collect
self.get_rbd_stats()
File "/usr/share/ceph/mgr/prometheus/module.py", line 734, in get_rbd_stats
'rbd_stats_pools_refresh_interval', 300)
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Actions #1

Updated by Torsten Ennenbach over 3 years ago

Had a little debugging with my college here:

Michael:palme: vor 31 Minuten
I just had a look and it really looks like a simple type-cast is missing. But I also would like to check why it returns a string instead of the 300 that was given as the default value.
Maybe the config for "rbd_stats_pools_refresh_interval" is wrong and NOT an integer as the code expects here. Worth a little check wherever those configs are kept @TorstenE
Michael:palme: vor 27 Minuten
The logdata seems to indicate that indeed the value for rbd_stats_pools_refresh_interval is a string and not an integer in our configuration:
var/log/ceph/ceph-mon.mon4.log:2021-01-09 11:01:31.676 7f8f6ad9f700 0 mon.mon4@0(leader) e11 handle_command mon_command({"prefix": "config set", "who": "mgr", "name": "mgr/prometheus/rbd_stats_pools_refresh_interval", "value": "600"} v 0) v1

our fix:
[root@ceph-rbd-mon1 ~]# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
mon advanced mon_sync_max_payload_size 4096
mgr advanced mgr/balancer/active true
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/balancer/upmap_max_deviation 2
mgr advanced mgr/dashboard/ssl false *
mgr advanced mgr/prometheus/rbd_stats_pools rbd, archive *
mgr advanced mgr/prometheus/rbd_stats_pools_refresh_interval 600 *
mgr advanced mgr/prometheus/scrape_interval 300 *
mgr basic mgr_stats_period 5
osd advanced debug_bluefs 0/0
osd advanced debug_bluestore 0/0
osd advanced debug_rocksdb 0/0
osd advanced osd_mon_heartbeat_stat_stale 1
osd advanced osd_snap_trim_sleep 1.000000
[root@ceph-rbd-mon1 ~]#

to this:
[root@ceph-rbd-mon1 ~]# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
mon advanced mon_sync_max_payload_size 4096
mgr advanced mgr/balancer/active true
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/balancer/upmap_max_deviation 2
mgr advanced mgr/dashboard/ssl false *
mgr advanced mgr/prometheus/rbd_stats_pools rbd, archive *
mgr advanced mgr/prometheus/scrape_interval 300 *
mgr basic mgr_stats_period 5
osd advanced debug_bluefs 0/0
osd advanced debug_bluestore 0/0
osd advanced debug_rocksdb 0/0
osd advanced osd_mon_heartbeat_stat_stale 1
osd advanced osd_snap_trim_sleep 1.000000

via set: ceph config rm mgr mgr/prometheus/rbd_stats_pools_refresh_interval

and prometheus module works again

Actions #2

Updated by Torsten Ennenbach over 3 years ago

well only for a couple of minutes :(

Actions #3

Updated by Kefu Chai about 3 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF