Project

General

Profile

Bug #47875

mgr/prometheus : ENGINE Error in HTTPServer.tick

Added by Christophe Trussardi over 3 years ago. Updated about 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In a large Ceph cluster (docker image ceph/daemon:v5.0.3-stable-5.0-octopus-centos-8 used) with a number of OSDs > 1000, the Prometheus module in MGR loops with a Python3 "cheroot" error :

ENGINE Error in HTTPServer.tick
Traceback (most recent call last):
  File "/usr/lib/python3.6/dist-packages/cheroot/server.py", line 1770, in serve
    self.tick()
  File "/usr/lib/python3.6/dist-packages/cheroot/server.py", line 1993, in tick
    conn = self.connections.get_conn(self.socket)
  File "/usr/lib/python3.6/dist-packages/cheroot/connections.py", line 142, in get_conn
    rlist, _, _ = select.select(list(socket_dict), [], [], 0.1)
ValueError: filedescriptor out of range in select()

Dashboard and Prometheus in MGR hanged while the above error loops at a very high rate.

Updating "cheroot" with a more recent version (>= 8.4.0) solves the issue (dashboard and prometheus modules have to be disabled and re-enabled)

Related "cheroot" issue on Github : https://github.com/cherrypy/cheroot/issues/249

History

#1 Updated by Christophe Trussardi over 3 years ago

According to https://github.com/ceph/ceph-container/issues/1748, this will have to wait an upgraded Cheroot for RHEL / Centos environments.

You can close this issue.

#2 Updated by Ken Dreyer over 3 years ago

Thanks for reporting this issue.

I've pushed an update to Fedora at https://src.fedoraproject.org/rpms/python-cheroot/pull-request/3 . If I built this for el8 in a side repo, would you be willing to test that new package out?

#3 Updated by Ken Dreyer over 3 years ago

I've published an el8 RPM at https://fedorapeople.org/~ktdreyer/bz1868629/ for early testing. I can bring up a "hello world" cherrypy app with this. If you have a chance to test, any feedback is helpful.

#4 Updated by David Orman over 3 years ago

Ken Dreyer wrote:

I've published an el8 RPM at https://fedorapeople.org/~ktdreyer/bz1868629/ for early testing. I can bring up a "hello world" cherrypy app with this. If you have a chance to test, any feedback is helpful.

Thank you for the expedient update!

We have rebuilt the container images of 15.2.7 with this RPM applied, and will be deploying it to a larger (504 OSD) cluster to test - this cluster had the issue previously until we disabled polling via Prometheus, which we cannot do on production clusters. We would suggest modifying this bug to high priority, as it prevents production deployments on clusters large enough to trigger this issue. We will update as soon as it's run for a day or two and we've been able to verify the mgr issues we saw no longer occur after extended polling via external and internal prometheus instances.

#5 Updated by Ken Dreyer over 3 years ago

Ok great. The official epel8 branch is going to take a little longer to update. I've pushed https://src.fedoraproject.org/rpms/python-cheroot/pull-request/4 to make it build on Fedora, and then we can merge the few epel8 divergences into the Rawhide branch (master), and then we'll merge master into epel8, and then build and ship to https://fedoraproject.org/wiki/EPEL/testing

#6 Updated by Ken Dreyer about 3 years ago

The final dist-git change to bring 8.5.1 to epel8 is https://src.fedoraproject.org/rpms/python-cheroot/pull-request/10

#7 Updated by Ken Dreyer about 3 years ago

The epel8 build is headed to epel-testing now at https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-e204b1c101

#8 Updated by A. Saber Shenouda about 3 years ago

When is it expected to hit docker's container image ? We are affected by this bug also.

Also available in: Atom PDF