Project

General

Profile

Actions

Bug #23300

closed

ceph-mgr returns internal error

Added by Nico Schottelius about 6 years ago. Updated about 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

after some weeks of running a new ceph cluster, we get the following answer from the mgr:

black3.place6:~# curl http://[2a0a:e5c0:2:1:20d:b9ff:fe48:3bb8]:9283/metrics
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
<title>500 Internal Server Error</title>
<style type="text/css">
#powered_by {
margin-top: 20px;
border-top: 2px solid black;
font-style: italic;
}

#traceback {
color: red;
}
&lt;/style&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;h2&gt;500 Internal Server Error&lt;/h2&gt;
&lt;p&gt;The server encountered an unexpected condition which prevented it from fulfilling the request.&lt;/p&gt;
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 414, in metrics
    metrics = global_instance().collect()
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 351, in collect
    self.get_metadata_and_osd_status()
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 310, in get_metadata_and_osd_status
    dev_class['class'],
KeyError: 'class'

&lt;div id=&quot;powered_by&quot;&gt;
&lt;span&gt;
Powered by &lt;a href=&quot;http://www.cherrypy.org&amp;quot;&amp;gt;CherryPy 3.5.0&lt;/a&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;

Changing / starting another mgr does not fix this problem.
Using 12.2.4-1~bpo90+1 on Devuan Ascii


Files

t.png (88.9 KB) t.png Nico Schottelius, 03/11/2018 07:19 PM
Actions #1

Updated by Nico Schottelius about 6 years ago

Fun fact: it used to run fine until we were introducing new crush rules and changing the crush rule for a pool:

ceph osd crush rule create-replicated hdd-small default host hdd-small
ceph osd crush rule create-replicated hdd-big default host hdd-big
ceph osd pool set hdd crush_rule hdd-big

Actions #2

Updated by Nico Schottelius about 6 years ago

Found it! We had several osds without a device class attached, because we did not want to use them at the moment.
Adding a "fake" class to it fixed the mgr's prometheus interface.


[20:23:10] server1.place6:~# ceph osd crush set-device-class notinuse 12 14 11 13 25 4
set osd(s) 4,11,12,13,14,25 to class 'notinuse'

[20:25:20] server1.place6:~# ceph osd tree
ID CLASS     WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
-1           125.59419 root default                             
-7            46.60368     host server2                         
15   hdd-big   9.09511         osd.15       up  1.00000 1.00000 
20   hdd-big   9.09511         osd.20       up  1.00000 1.00000 
21   hdd-big   9.09511         osd.21       up  1.00000 1.00000 
 7 hdd-small   4.54776         osd.7        up  1.00000 1.00000 
 8 hdd-small   4.54776         osd.8        up  1.00000 1.00000 
10 hdd-small   4.54776         osd.10       up  1.00000 1.00000 
12  notinuse   0.21767         osd.12       up  1.00000 1.00000 
14  notinuse   5.45741         osd.14       up  1.00000 1.00000 
-5            42.50967     host server3                         
 9   hdd-big   9.09511         osd.9        up  1.00000 1.00000 
16   hdd-big   9.09511         osd.16       up  1.00000 1.00000 
19   hdd-big   9.09511         osd.19       up  1.00000 1.00000 
 3 hdd-small   4.54776         osd.3        up  1.00000 1.00000 
 5 hdd-small   4.54776         osd.5        up  1.00000 1.00000 
 6 hdd-small   4.54776         osd.6        up  1.00000 1.00000 
11  notinuse   0.45424         osd.11       up  1.00000 1.00000 
13  notinuse   0.90907         osd.13       up  1.00000 1.00000 
25  notinuse   0.21776         osd.25       up  1.00000 1.00000 
-2            36.48083     host server4                         
 2   hdd-big   9.09511         osd.2        up  1.00000 1.00000 
17   hdd-big   9.09511         osd.17       up  1.00000 1.00000 
18   hdd-big   9.09511         osd.18       up  1.00000 1.00000 
 0 hdd-small   4.54776         osd.0        up  1.00000 1.00000 
 1 hdd-small   4.54776         osd.1        up  1.00000 1.00000 
 4  notinuse   0.09999         osd.4        up  1.00000 1.00000 
[20:26:39] server1.place6:~# 

Actions #3

Updated by Greg Farnum about 6 years ago

  • Project changed from Ceph to mgr
Actions #4

Updated by John Spray about 6 years ago

  • Category set to prometheus module
  • Status changed from New to Duplicate

This was fixed in master recently and is being backported to luminous here: https://github.com/ceph/ceph/pull/20642

Actions

Also available in: Atom PDF