mgr/prometheus : ENGINE Error in HTTPServer.tick
In a large Ceph cluster (docker image ceph/daemon:v5.0.3-stable-5.0-octopus-centos-8 used) with a number of OSDs > 1000, the Prometheus module in MGR loops with a Python3 "cheroot" error :
ENGINE Error in HTTPServer.tick Traceback (most recent call last): File "/usr/lib/python3.6/dist-packages/cheroot/server.py", line 1770, in serve self.tick() File "/usr/lib/python3.6/dist-packages/cheroot/server.py", line 1993, in tick conn = self.connections.get_conn(self.socket) File "/usr/lib/python3.6/dist-packages/cheroot/connections.py", line 142, in get_conn rlist, _, _ = select.select(list(socket_dict), , , 0.1) ValueError: filedescriptor out of range in select()
Dashboard and Prometheus in MGR hanged while the above error loops at a very high rate.
Updating "cheroot" with a more recent version (>= 8.4.0) solves the issue (dashboard and prometheus modules have to be disabled and re-enabled)
Related "cheroot" issue on Github : https://github.com/cherrypy/cheroot/issues/249
#1 Updated by Christophe Trussardi about 3 years ago
According to https://github.com/ceph/ceph-container/issues/1748, this will have to wait an upgraded Cheroot for RHEL / Centos environments.
You can close this issue.
#2 Updated by Ken Dreyer almost 3 years ago
Thanks for reporting this issue.
I've pushed an update to Fedora at https://src.fedoraproject.org/rpms/python-cheroot/pull-request/3 . If I built this for el8 in a side repo, would you be willing to test that new package out?
#4 Updated by David Orman almost 3 years ago
Ken Dreyer wrote:
I've published an el8 RPM at https://fedorapeople.org/~ktdreyer/bz1868629/ for early testing. I can bring up a "hello world" cherrypy app with this. If you have a chance to test, any feedback is helpful.
Thank you for the expedient update!
We have rebuilt the container images of 15.2.7 with this RPM applied, and will be deploying it to a larger (504 OSD) cluster to test - this cluster had the issue previously until we disabled polling via Prometheus, which we cannot do on production clusters. We would suggest modifying this bug to high priority, as it prevents production deployments on clusters large enough to trigger this issue. We will update as soon as it's run for a day or two and we've been able to verify the mgr issues we saw no longer occur after extended polling via external and internal prometheus instances.
#5 Updated by Ken Dreyer almost 3 years ago
Ok great. The official epel8 branch is going to take a little longer to update. I've pushed https://src.fedoraproject.org/rpms/python-cheroot/pull-request/4 to make it build on Fedora, and then we can merge the few epel8 divergences into the Rawhide branch (master), and then we'll merge master into epel8, and then build and ship to https://fedoraproject.org/wiki/EPEL/testing
#6 Updated by Ken Dreyer almost 3 years ago
The final dist-git change to bring 8.5.1 to epel8 is https://src.fedoraproject.org/rpms/python-cheroot/pull-request/10
#7 Updated by Ken Dreyer almost 3 years ago
The epel8 build is headed to epel-testing now at https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-e204b1c101