Project

General

Profile

Bug #55837

mgr/dashboard: After several days of not being used, Dashboard HTTPS website hangs during loading, with no errors

Added by Zach Heise 6 months ago. Updated 3 months ago.

Status:
Need More Info
Priority:
Normal
Category:
General - Back-end
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific quincy
Regression:
No
Severity:
4 - irritation
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description of problem

I have noticed that every 2-3 days, in the morning when I start work, our ceph dashboard page does not respond in the browser. It works fine throughout the day, but it seems like after a certain unknown hours without anyone accessing it. Something must be going wrong with the dashboard module, or mgr daemon, because when I try to load (or refresh when it's already loaded) the ceph dashboard site, the browser just does the “throbber": spinning/loading on the page's favicon – no content on the page ever appears, no errors or anything. None of the buttons on the page load – nor time out and show a 404 – for example, Block\Images or Cluster\Hosts in the left sidebar will load, but show empty. And the throbber never stops.

I can easily fix it with ceph mgr module disable dashboard and then waiting 10 seconds, then ceph mgr module enable dashboard – this makes it start working again, until the next time I go a few days without using the dashboard, at which point I need to do the same process again. So this is an 'irritation' level bug.

Environment

  • ceph version string: 16.2.4 through 16.2.7
  • Platform (OS/distro/release): Linux/Alma/8.5 Kernel is 5.4.176-1.el8.elrepo.x86_64.
  • Cluster details (nodes, monitors, OSDs): 5 nodes, 4 monitors, 33 OSDs
  • Did it happen on a stable environment or after a migration/upgrade?: this has been an issue for the entirety of this cluster's existence
  • Browser used (e.g.: Version 86.0.4240.198 (Official Build) (64-bit)): Browsers tested in: Windows Firefox, Windows MS Edge, and Brave browser. But also, lack of loading can be confirmed with curl from linux:
    At the request of redhat employee Ernesto Puerta, I also ran curl against the HTTPS port both to the actual IP of the mgr daemon, and to its local address of 127.0.0.1:8443. Both curl attempts just result in the CLI cursor blinking forever, with no output. When attempting to run curl against an unused port - 127.0.0.1:8444, "connection refused" immediately occurs (as expected).

How reproducible

Steps:

  1. Activate HTTPS mode for dashboard
  2. Don't have any user activity (HTTP GET) for 48+ hours
  3. Attempt to load the dashboard page

Actual results

When I’m in this hanging state, I check the cephadm logs with cephadm logs --name mgr.ceph01.fblojp -- -f but there’s nothing obvious (to my untrained eyes at least). When the dashboard is functional, I can see my own navigation around the dashboard in the logs so I know that logging is working:

Nov 01 15:46:32 ceph01.domain conmon5814: debug 2021-11-01T20:46:32.601+0000 7f7cbb42e700 0 [dashboard INFO request] [10.130.50.252:52267] [GET] [200] [0.013s] [admin] [1.0K] /api/summary

Expected results

Functioning dashboard

Additional info

I confirmed that after switching my dashboard's mode to HTTP only on 8080 for the past several months, this issue does NOT occur, so it must be something related to HTTPS
all nodes have 2x NICs that are bonded as of now. But this issue with the dashboard started before we performed the bonding.
Final note: as part of my testing, I replaced the built-in self-signed SSL certs, with ones generated from my windows PKI infrastructure. It made no difference whatsoever.

History

#1 Updated by Ernesto Puerta 6 months ago

  • Category set to General - Back-end
  • Status changed from New to Need More Info
  • Assignee set to Ernesto Puerta
  • Target version set to v18.0.0
  • Source set to Community (user)
  • Backport set to pacific quincy
  • Affected Versions v16.2.8, v17.0.0, v18.0.0 added

#2 Updated by Ernesto Puerta 6 months ago

This has been currently reproduced by another Ceph user, who coincidentally is running security scanner tools (e.g.: Qualys, Tanium, ...), so this perfectly matches the description of this cheroot user who reported that their Cheroot webserver stopped responding.

The rationale would be the following:
  • Security scanning tools try to apply known exploits to network services, including malformed SSL handshake messages (known as SSL/TLS DDoS Exhaustion attacks).
  • Cheroot using the builtin ssl module might not properly release the connection under this situation leading the server to run out of idle worker threads.
Workaround and fixes:
  • Applying the patch above mentioned is reported to resolve the issue.
  • Improve error logging (enable cherrypy.log.error_log.propagate = True in module.py)
  • Improve idle thread reporting (cherrypy.server.httpserver.stats['Threads Idle']())
  • Moving from Python builtin module to PyOpenSSL ('server.ssl_module':'pyopenssl')

#3 Updated by Zach Heise 6 months ago

Unfortunately, simply setting cherrypy.log.error_log.propagate = True in module.py file on the active mgr, and applying the code at github (then disabling/enabling the dashboard) did not fix the issue.

Ernesto - I'm intrigued by your last bullet point, moving from python's builtin module to PyOpenSSL - that might also fix the issue as it removes CherryPy from the equation?

The second question is how would I get these changes to apply across new mgr daemons that would be spun-up in the future. Right now I'm manually editing the code on existing mgr daemons which seems quite brittle in a containerized environment.

Thanks!

#4 Updated by Ernesto Puerta 5 months ago

Zach Heise wrote:

Unfortunately, simply setting cherrypy.log.error_log.propagate = True in module.py file on the active mgr, and applying the code at github (then disabling/enabling the dashboard) did not fix the issue.

:(

Ernesto - I'm intrigued by your last bullet point, moving from python's builtin module to PyOpenSSL - that might also fix the issue as it removes CherryPy from the equation?

Hi Zach,

Sorry for the delay. I missed this one.

It looks like that PyOpenSSL was there only for Python 2.6 support, and no longer working.

The second question is how would I get these changes to apply across new mgr daemons that would be spun-up in the future. Right now I'm manually editing the code on existing mgr daemons which seems quite brittle in a containerized environment.

I'd suggest that you reduce the mgr count to 1 (from cephadm cli or Dashboard Services section). That will save you from having to copy the code to multiple locations.

Once we find the issue, if it's on the Ceph side code, we'll try to deliver a fix asap to the active releases. If it's on the Cherrypy/Cheroot side we'll refresh those packages for the community images.

Did you have the chance to test the following patch to list the number of idle threads?:

diff --git a/src/pybind/mgr/dashboard/tools.py b/src/pybind/mgr/dashboard/tools.py
index 4e4837d9323..92bffa32c61 100644
--- a/src/pybind/mgr/dashboard/tools.py
+++ b/src/pybind/mgr/dashboard/tools.py
@@ -113,14 +113,15 @@ class RequestLoggingTool(cherrypy.Tool):
         else:
             length = self._format_bytes(0)
         if user:
-            logger_fn("[%s:%s] [%s] [%s] [%s] [%s] [%s] %s", req.remote.ip,
+            logger_fn("[%s:%s] [%s] [%s] [%s] [%s] [%s] %s Queue=%r Threads_Idle=%r/%r", req.remote.ip,
                       req.remote.port, req.method, status,
-                      "{0:.3f}s".format(lat), user, length, req.path_info)
+                      "{0:.3f}s".format(lat), user, length, req.path_info,
+                      cherrypy.server.httpserver.stats['Queue'](None), cherrypy.server.httpserver.stats['Threads Idle'](None), cherrypy.server.httpserver.stats['Threads'](None))
         else:
-            logger_fn("[%s:%s] [%s] [%s] [%s] [%s] [%s] %s", req.remote.ip,
+            logger_fn("[%s:%s] [%s] [%s] [%s] [%s] [%s] %s Queue=%r Threads_Idle=%r/%r", req.remote.ip,
                       req.remote.port, req.method, status,
-                      "{0:.3f}s".format(lat), length, getattr(req, 'unique_id', '-'), req.path_info)
-
+                      "{0:.3f}s".format(lat), length, getattr(req, 'unique_id', '-'), req.path_info,
+                      cherrypy.server.httpserver.stats['Queue'](None), cherrypy.server.httpserver.stats['Threads Idle'](None), cherrypy.server.httpserver.stats['Threads'](None))

 # pylint: disable=too-many-instance-attributes
 class ViewCache(object):

You should be able to apply that to .../mgr/dashboard/tools.py file:

patch .../dashboard/tools.py log_idle_threads.patch

You'll need to disable-enable the dashboard after that to reload changes.

#5 Updated by Zach Heise 4 months ago

Good afternoon! I've edited the active tools.py file in my active mgr's container with the patch changes. Now to wait a few days for the Qualys scanner to overload the server like it always done, then of course disabled/enabled the dashboard module.

From what little of the code I've just added I can understand, this shouldn't actually FIX anything, right? This is just tracking idle threads in the cherrypy.server.httpserver.stats database(?) so that we can keep track of what caused the server to be overloaded?

Assuming that things hang as usual in 48+ hours from now, what new information can I extract from the mgr daemon at that point to share here, that will likely prove that our campus qualys scanner is the cause, so that a patch can be written?

#6 Updated by Zach Heise 4 months ago

As expected, after this weekend the dashboard is unresponsive. But hopefully now with the edits to tools.py there should be more logging of idle threads - however, I'm not certain how to read the code in tools.py to figure out how to view the new "threads idle" and "queue" entries that, apparently, this new code is adding to cherrypy.server.httpserver.stats - is there some way i can view and export the results of the changes?

#7 Updated by Zach Heise 3 months ago

Okay, so after activating logging to file for my cluster, I have been watching the log files for the ceph-mgr on ceph03. As of last night, grepping for "Idle" in the the log file has one result. The IP address mentioned is my campus's Qualys pen testing server.

ceph03> zcat ceph-mgr.ceph03.futetp.log-20220830.gz | grep Idle
2022-08-29T11:45:51.167+0000 7f6944487700  0 [dashboard INFO request] [::ffff:128.104.53.175:36464] [GET] [200] [0.004s] [1.0K] [d94f2185-5e88-4db9-98af-648f88558cde] / Queue=0 Threads_Idle=9/10
2022-08-29T11:45:51.184+0000 7f6943485700  0 [dashboard INFO request] [::ffff:128.104.53.175:36474] [GET] [200] [0.003s] [1.0K] [9195733d-f43c-411a-8751-c4085484b129] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:51.204+0000 7f6944c88700  0 [dashboard INFO request] [::ffff:128.104.53.175:36478] [GET] [200] [0.004s] [1.0K] [7a9f4d77-9d64-4784-81ef-cd6b37fa4366] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.377+0000 7f6941481700  0 [dashboard INFO request] [::ffff:128.104.53.175:37628] [GET] [200] [0.004s] [576.0B] [6f0dfe36-dc4c-4517-9971-0e4d67493b1b] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.386+0000 7f6943c86700  0 [dashboard INFO request] [::ffff:128.104.53.175:37630] [POST] [415] [0.004s] [176.0B] [12605b4f-9080-4ceb-8b3f-7d390b02dcb3] /sdk/vimService Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.410+0000 7f6941c82700  0 [dashboard INFO request] [::ffff:128.104.53.175:37640] [GET] [200] [0.003s] [1.0K] [d48b5336-1717-4312-8e0e-18cb50e231c7] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.418+0000 7f694047f700  0 [dashboard INFO request] [::ffff:128.104.53.175:37642] [GET] [200] [0.003s] [1.0K] [3b0f9de1-d6c4-4c55-bf3b-5a5747b19025] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.508+0000 7f6944c88700  0 [dashboard INFO request] [::ffff:128.104.53.175:37664] [GET] [200] [0.003s] [1.0K] [c641e9ca-bb9c-4559-97bd-37f520deea6e] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.570+0000 7f6943c86700  0 [dashboard INFO request] [::ffff:128.104.53.175:37678] [GET] [200] [0.003s] [1.0K] [24fcc4b4-da0b-400b-aea7-b2a3660022d3] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.759+0000 7f6941c82700  0 [dashboard INFO request] [::ffff:128.104.53.175:37687] [GET] [404] [0.003s] [513.0B] [660f987e-3eb2-4226-b86e-b91e4c30523f] /bea_wls_internal/classes/weblogic/common/internal/VersionInfoFactory$VERSION_INFO_SINGLETON.class Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.767+0000 7f6940c80700  0 [dashboard WARNING request] [::ffff:128.104.53.175:37690] [GET] [403] [0.004s] [257.0B] [f7008332-73dd-461b-bd1c-e3b2b29e4d88] /../../../../../../../../../../../ Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.776+0000 7f6944487700  0 [dashboard INFO request] [::ffff:128.104.53.175:37704] [GET] [404] [0.006s] [513.0B] [18cc599d-db8c-44c1-a82e-ceb8cd6ed901] /ws Queue=0 Threads_Idle=7/10
2022-08-29T11:45:55.777+0000 7f6942c84700  0 [dashboard INFO request] [::ffff:128.104.53.175:37687] [GET] [404] [0.004s] [513.0B] [bce491b6-7379-497c-8358-71c2a93656a9] /bea_wls_internal/classes/weblogic/common/internal/VersionInfoFactory$VERSION_INFO_SINGLETON.class Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.879+0000 7f6941481700  0 [dashboard INFO request] [::ffff:128.104.53.175:37780] [ABCD] [200] [0.003s] [1.0K] [675ca3f2-2a48-4b82-b85f-fd68b8fb7f92] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:55.901+0000 7f6941c82700  0 [dashboard INFO request] [::ffff:128.104.53.175:37788] [GET] [404] [0.003s] [513.0B] [b11336d5-0043-4cbe-910e-6aae9230ec78] /<script>alert(53416)</script> Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.034+0000 7f6944487700  0 [dashboard INFO request] [::ffff:128.104.53.175:37838] [GET] [200] [0.003s] [1.0K] [0fa96301-c901-4d7b-a97d-641d29d6bd28] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.059+0000 7f6943485700  0 [dashboard INFO request] [::ffff:128.104.53.175:37846] [GET] [404] [0.003s] [513.0B] [53f62225-9e3c-4ef8-9054-f2cbe2869d87] /py3dvmdi4faeoypje7h Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.082+0000 7f6941481700  0 [dashboard INFO request] [::ffff:128.104.53.175:37864] [GET] [200] [0.003s] [1.0K] [9aa29d44-fef0-47f6-9a48-e7f5d24fd032] /index.html Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.098+0000 7f6941c82700  0 [dashboard INFO request] [::ffff:128.104.53.175:37884] [GET] [200] [0.002s] [1.0K] [c8cf2962-60af-4f3e-ba77-ebb1c55c4554] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.114+0000 7f6940c80700  0 [dashboard INFO request] [::ffff:128.104.53.175:37894] [GET] [200] [0.003s] [1.0K] [41e0dad9-b5dd-4d4e-b292-c983f2e2f850] /index.html Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.124+0000 7f6942c84700  0 [dashboard INFO request] [::ffff:128.104.53.175:37906] [QUALYS] [200] [0.003s] [1.0K] [5cf9d3d6-1839-4d96-8d1c-58cffdebe8a8] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.137+0000 7f6944c88700  0 [dashboard INFO request] [::ffff:128.104.53.175:37920] [OPTIONS] [200] [0.005s] [1.0K] [2bd9373d-4244-480d-b970-71d52aa2f6a8] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.149+0000 7f6943c86700  0 [dashboard INFO request] [::ffff:128.104.53.175:37930] [TRACE] [200] [0.003s] [1.0K] [37bbce49-303f-42d9-9c76-c4929fede180] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.161+0000 7f694047f700  0 [dashboard INFO request] [::ffff:128.104.53.175:37942] [PROPFIND] [200] [0.003s] [1.0K] [fb405346-6361-48a5-8785-de284a813513] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.267+0000 7f6943485700  0 [dashboard INFO request] [::ffff:128.104.53.175:38010] [GET] [200] [0.002s] [1.0K] [64011c64-95de-4c6a-9ce6-2319b3f8259e] /index.html Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.410+0000 7f6944c88700  0 [dashboard INFO request] [::ffff:128.104.53.175:38086] [GET] [404] [0.003s] [513.0B] [80e8ac21-49b5-438e-bda5-6283cbe130db] /index.html.............. Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.424+0000 7f6943c86700  0 [dashboard INFO request] [::ffff:128.104.53.175:38096] [GET] [404] [0.003s] [513.0B] [4cf867a6-ea12-42bb-bf26-d6102bb91618] /py3dvmdi4faeoypje7h Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.458+0000 7f694047f700  0 [dashboard INFO request] [::ffff:128.104.53.175:38138] [PROPFIND] [200] [0.003s] [1.0K] [f0100ff2-432d-4e90-82f1-cb47809f8751] /index.html Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.480+0000 7f6944487700  0 [dashboard INFO request] [::ffff:128.104.53.175:38172] [GET] [200] [0.003s] [1.0K] [a1ef73f9-c46f-444b-bf9f-bbb5112fe9df] /index.html Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.506+0000 7f6943485700  0 [dashboard INFO request] [::ffff:128.104.53.175:38198] [BDMT] [200] [0.003s] [1.0K] [524eded9-a7c7-4440-98f3-57a5e54164c3] / Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.536+0000 7f6941481700  0 [dashboard INFO request] [::ffff:128.104.53.175:38212] [GET] [404] [0.003s] [513.0B] [e246f1a4-b85a-4d14-b675-316f8a103c14] /index.html/.. Queue=0 Threads_Idle=8/10
2022-08-29T11:45:56.552+0000 7f6941c82700  0 [dashboard INFO request] [::ffff:128.104.53.175:38228] [HEAD] [200] [0.003s] [1.0K] [f8bc187d-2f46-4979-bf77-28da22bb30a2] /index.html Queue=0 Threads_Idle=8/10
2022-08-29T11:46:21.375+0000 7f6944c88700  0 [dashboard INFO request] [::ffff:128.104.53.175:44158] [GET] [404] [0.004s] [513.0B] [15f599c6-8963-4f56-bbfe-66d43f5dc91d] /\./index.html Queue=0 Threads_Idle=9/10
2022-08-29T11:46:31.391+0000 7f6941481700  0 [dashboard WARNING request] [::ffff:128.104.53.175:46820] [GET] [403] [0.003s] [257.0B] [d67a5e6a-6611-413c-8d50-7adaf5e999c7] /../index.html Queue=0 Threads_Idle=9/10
2022-08-29T11:46:51.453+0000 7f6944487700  0 [dashboard INFO request] [::ffff:128.104.53.175:33682] [QUALYS] [200] [0.003s] [1.0K] [534ad5ca-fc11-42f5-b4ac-c131be63739f] / Queue=0 Threads_Idle=8/10
2022-08-29T11:46:51.488+0000 7f6943485700  0 [dashboard INFO request] [::ffff:128.104.53.175:33726] [PROPFIND] [200] [0.003s] [1.0K] [288863eb-d06a-4bc1-9da8-aafb96525086] / Queue=0 Threads_Idle=9/10
2022-08-29T11:46:51.502+0000 7f6941481700  0 [dashboard INFO request] [::ffff:128.104.53.175:33790] [GET] [404] [0.004s] [513.0B] [0ec0e6e9-5222-455b-b145-d71c9e847469] /py3dvmdi4faeoypje7h/../index.html Queue=0 Threads_Idle=9/10
2022-08-29T11:46:51.566+0000 7f6941c82700  0 [dashboard INFO request] [::ffff:128.104.53.175:33826] [SEARCH] [200] [0.003s] [1.0K] [666d6d1d-e574-4a95-b9a7-aeacaff2cb8a] / Queue=0 Threads_Idle=9/10
2022-08-29T11:46:51.635+0000 7f6944487700  0 [dashboard INFO request] [::ffff:128.104.53.175:33980] [HEAD] [200] [0.003s] [1.0K] [4ace2418-5d16-4ab4-9c7d-03af0872a4ca] /index.html Queue=0 Threads_Idle=9/10
2022-08-29T11:46:51.671+0000 7f6942c84700  0 [dashboard INFO request] [::ffff:128.104.53.175:34064] [OPTIONS] [200] [0.003s] [1.0K] [d835b5b1-8458-4cba-8840-360d88071ad9] /index.html Queue=0 Threads_Idle=9/10
2022-08-29T11:46:51.726+0000 7f6944c88700  0 [dashboard INFO request] [::ffff:128.104.53.175:34136] [GET] [200] [0.002s] [1.0K] [79d506d0-8e19-4068-ba8e-293090519d09] /index.html Queue=0 Threads_Idle=9/10
2022-08-29T11:46:51.787+0000 7f6943c86700  0 [dashboard INFO request] [::ffff:128.104.53.175:34216] [GET] [404] [0.003s] [513.0B] [2c05a44f-f350-4ba4-8ee4-b574ddeb28f9] /\./index.html Queue=0 Threads_Idle=9/10
2022-08-29T11:46:51.866+0000 7f6941c82700  0 [dashboard INFO request] [::ffff:128.104.53.175:34296] [GET] [200] [0.005s] [1.0K] [eccabc2d-abaa-4087-8ca2-2c3a40b32902] /index.html Queue=0 Threads_Idle=9/10

That's the most recent .gz of the log files, 202208430. Looking at the logs for the previous few days since before the weekend (20220829, 20220828, 20220827) with the same grep, comes back with nothing. The next thing in the logs is from Thursday, when I did one last test of the Dashboard to make sure it could still be loaded from my own workstation:

ceph03> zcat ceph-mgr.ceph03.futetp.log-20220826.gz | grep Idle
2022-08-25T17:44:12.230+0000 7f6944487700  0 [dashboard INFO request] [::ffff:10.130.50.217:56256] [GET] [304] [0.004s] [0.0B] [80d4520d-eeb4-4e13-be49-7fb4fdcc70ff] / Queue=0 Threads_Idle=9/10
2022-08-25T17:44:12.291+0000 7f6942483700  0 [dashboard INFO request] [::ffff:10.130.50.217:56256] [GET] [304] [0.003s] [0.0B] [2b0b1a20-6aa5-4874-a370-a8daf43a9807] /styles.0520b6222fe6dab25bab.css Queue=0 Threads_Idle=9/10
2022-08-25T17:44:12.302+0000 7f6943485700  0 [dashboard INFO request] [::ffff:10.130.50.217:56257] [GET] [304] [0.003s] [0.0B] [46f3db3f-86ba-436f-94fd-b075b68c08e4] /polyfills.69188bf73a1e0d939338.js Queue=0 Threads_Idle=9/10
2022-08-25T17:44:12.317+0000 7f6944c88700  0 [dashboard INFO request] [::ffff:10.130.50.217:56256] [GET] [304] [0.004s] [0.0B] [2f4f533b-ad39-430a-af20-1f033108b5e1] /runtime.fc090c2eb8af922526a6.js Queue=0 Threads_Idle=8/10
2022-08-25T17:44:12.317+0000 7f6941481700  0 [dashboard INFO request] [::ffff:10.130.50.217:56258] [GET] [304] [0.007s] [0.0B] [31fe0958-b58b-4488-b99b-0ad5e025625f] /scripts.6bda3fa7e09a87cd4228.js Queue=0 Threads_Idle=9/10
2022-08-25T17:44:12.326+0000 7f6943c86700  0 [dashboard INFO request] [::ffff:10.130.50.217:56257] [GET] [304] [0.003s] [0.0B] [b51f49f7-bc47-4b84-83e6-5801cd489d3e] /main.d269a7c492a93e2ebedb.js Queue=0 Threads_Idle=8/10
2022-08-25T17:44:12.500+0000 7f6941c82700  0 [dashboard INFO request] [::ffff:10.130.50.217:56258] [GET] [304] [0.013s] [0.0B] [20902dca-82b5-478c-8a5a-7edcec4e674b] /5.f6931a7617fe7f8fd244.js Queue=0 Threads_Idle=5/10
2022-08-25T17:44:12.501+0000 7f694047f700  0 [dashboard INFO request] [::ffff:10.130.50.217:56260] [GET] [200] [0.009s] [29.0B] [45185130-4b37-4f3d-a855-916b5728ee37] /ui-api/langs Queue=0 Threads_Idle=5/10
2022-08-25T17:44:12.501+0000 7f6940c80700  0 [dashboard INFO request] [::ffff:10.130.50.217:56256] [GET] [304] [0.012s] [0.0B] [b2d29a83-a372-4675-aa66-5a1ac1ba8cd6] /6.e0fafffe422f8212d682.js Queue=0 Threads_Idle=6/10
2022-08-25T17:44:12.503+0000 7f6944487700  0 [dashboard INFO request] [::ffff:10.130.50.217:56257] [GET] [304] [0.016s] [0.0B] [8aa6385d-19b2-48f6-8efc-5d9b8a1e2e77] /1.e0a29c1d4fcf893cf11d.js Queue=0 Threads_Idle=7/10
2022-08-25T17:44:12.503+0000 7f6942c84700  0 [dashboard INFO request] [::ffff:10.130.50.217:56259] [POST] [200] [0.006s] [74.0B] [8f95fac3-73b9-44c2-8a27-19b5b22729d1] /api/auth/check Queue=0 Threads_Idle=9/10
2022-08-25T17:44:12.516+0000 7f6942483700  0 [dashboard INFO request] [::ffff:10.130.50.217:56261] [GET] [304] [0.006s] [0.0B] [5dd0e60d-9893-4cf1-a1c2-e5d71b965e71] /assets/Ceph_Ceph_Logo_with_text_white.svg Queue=0 Threads_Idle=8/10
2022-08-25T17:44:12.516+0000 7f6943485700  0 [dashboard INFO request] [::ffff:10.130.50.217:56258] [GET] [304] [0.003s] [0.0B] [438ccbeb-cafd-4b4b-8621-c03295b51770] /ceph_background.e82dd79127290ddbe8cb.gif Queue=0 Threads_Idle=8/10
2022-08-25T17:44:12.538+0000 7f6944c88700  0 [dashboard INFO request] [::ffff:10.130.50.217:56258] [GET] [304] [0.003s] [0.0B] [a6ce6f5e-141d-4bf5-b0a4-109a7d9d7e70] /forkawesome-webfont.7c20758e3e7c7dff7c8d.woff2 Queue=0 Threads_Idle=9/10
2022-08-25T17:44:12.561+0000 7f6941481700  0 [dashboard INFO request] [::ffff:10.130.50.217:56258] [GET] [304] [0.003s] [0.0B] [fcfebdfd-a522-4c19-8f29-99cb5aae7988] /favicon.ico Queue=0 Threads_Idle=9/10

Are these of any help? I've certainly heard of HTTP 'GET' and 'POST' commands but these other ones that the Qualys server is using are completely unknown to me. One thing that didn't seem to happen is "Threads_Idle" was never exhausted down to zero idle threads.

At this moment, attempting to load the Dashboard on port 8443 results in the typical post-weekend "spinning forever" result. And when I grep the current log file for "Idle" or "dashboard" nothing comes back at all.

#8 Updated by Ernesto Puerta 3 months ago

Hi Zach,

The fact that prior to the hang you didn't get any traces indicating the exhaustion of idle worker threads (e.g.: 0/10, and a non-zero Request Queue) seems to point to a dead end on that track.

So the next step would be to take a dump of the tracebacks of each Python thread. That way we'd know where exactly things are stuck (unless the block is happening in the non-Python arena, like inside an OpenSSL call).

For that, you should update the ceph-mgr containers with the following patch:

diff --git a/src/pybind/mgr/dashboard/module.py b/src/pybind/mgr/dashboard/module.py
index 946c853f880..79540211983 100644
--- a/src/pybind/mgr/dashboard/module.py
+++ b/src/pybind/mgr/dashboard/module.py
@@ -388,6 +388,14 @@ class Module(MgrModule, CherryPyConfig):
             self.set_store(item_key, inbuf)
         return 0, f'SSL {item_label} updated', ''

+    @CLIReadCommand("dashboard dump threads")
+    def dump_threads(self):
+        import traceback
+        output = [
+            f'{th}:\n{"".join(traceback.format_stack(sys._current_frames()[th.ident]))}'
+            for th in threading.enumerate() if th.ident in sys._current_frames()]
+        return HandleCommandResult(stdout='\n'.join(output))
+
     @CLIWriteCommand("dashboard set-ssl-certificate")
     def set_ssl_certificate(self, mgr_id: Optional[str] = None, inbuf: Optional[str] = None):
         return self._set_ssl_item('certificate', 'crt', mgr_id, inbuf)

After disabling/enabling the Dashboard module, you should be able to run a new CLI command:

# ceph dashboard dump threads
<_DummyThread(Dummy-1, started daemon 140067733673728)>:
  File "/ceph/src/pybind/mgr/mgr_module.py", line 1734, in _handle_command
    return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/ceph/src/pybind/mgr/mgr_module.py", line 462, in call
    return self.func(mgr, **kwargs)
  File "/ceph/src/pybind/mgr/dashboard/module.py", line 425, in dump_threads
    for th in threading.enumerate() if th.ident in sys._current_frames()]
  File "/ceph/src/pybind/mgr/dashboard/module.py", line 425, in <listcomp>
    for th in threading.enumerate() if th.ident in sys._current_frames()]
<_DummyThread(Dummy-2, started daemon 140067110721280)>:
  File "/ceph/src/pybind/mgr/dashboard/module.py", line 368, in serve
    self.shutdown_event.wait()
  File "/usr/lib64/python3.6/threading.py", line 569, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()

I'd suggest that you take a dump when everything is working fine and another after the freeze. Hopefully that will allow us to better understand where the lock is happening.

Thank you!

#9 Updated by Zach Heise 3 months ago

Okay! As requested, here is a new 'dump threads' command from just after a successful start of the dashboard module. The HTTPS server is working fine at this point.

ceph03> ceph dashboard dump threads

<_DummyThread(Dummy-1, started daemon 140329322194688)>:
  File "/usr/share/ceph/mgr/mgr_module.py", line 1386, in _handle_command
    return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 397, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/module.py", line 388, in dump_threads
    for th in threading.enumerate() if th.ident in sys._current_frames()]
  File "/usr/share/ceph/mgr/dashboard/module.py", line 388, in <listcomp>
    for th in threading.enumerate() if th.ident in sys._current_frames()]
<_DummyThread(Dummy-2, started daemon 140328779454208)>:
  File "/usr/share/ceph/mgr/dashboard/module.py", line 360, in serve
    self.shutdown_event.wait()
  File "/lib64/python3.6/threading.py", line 569, in wait
    signaled = self._cond.wait(timeout)
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<Thread(HTTPServer Thread-4, started daemon 140328078325504)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib64/python3.6/threading.py", line 885, in run
    self._target(*self._args, **self._kwargs)
  File "/lib/python3.6/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread
    self.httpserver.start()
  File "/lib/python3.6/site-packages/cheroot/server.py", line 1837, in start
    self.serve()
  File "/lib/python3.6/site-packages/cheroot/server.py", line 1810, in serve
    self._connections.run(self.expiration_interval)
  File "/lib/python3.6/site-packages/cheroot/connections.py", line 201, in run
    self._run(expiration_interval)
  File "/lib/python3.6/site-packages/cheroot/connections.py", line 210, in _run
    active_list = self._selector.select(timeout=0.01)
  File "/lib/python3.6/site-packages/cheroot/connections.py", line 106, in select
    for key, _ in self._selector.select(timeout=timeout)
  File "/lib64/python3.6/selectors.py", line 445, in select
    fd_event_list = self._epoll.poll(timeout, max_ev)
<WorkerThread(CP Server Thread-5, started daemon 140328069932800)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<WorkerThread(CP Server Thread-6, started daemon 140328061540096)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<WorkerThread(CP Server Thread-7, started daemon 140328053147392)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<WorkerThread(CP Server Thread-8, started daemon 140328044754688)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<WorkerThread(CP Server Thread-9, started daemon 140328036361984)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<WorkerThread(CP Server Thread-10, started daemon 140328027969280)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<WorkerThread(CP Server Thread-11, started daemon 140328019576576)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<WorkerThread(CP Server Thread-12, started daemon 140328011183872)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<WorkerThread(CP Server Thread-13, started daemon 140328002791168)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<WorkerThread(CP Server Thread-14, started daemon 140327994398464)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/site-packages/cheroot/workers/threadpool.py", line 116, in run
    conn = self.server.requests.get()
  File "/lib64/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<NotificationQueue(Thread-15, started daemon 140327986005760)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/usr/share/ceph/mgr/dashboard/tools.py", line 401, in run
    self._cond.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()

That was from last week. As usual, this monday the server is hanging. Here's a new dump threads:

ceph03> ceph dashboard dump threads

<_DummyThread(Dummy-1, started daemon 140329322194688)>:
  File "/usr/share/ceph/mgr/mgr_module.py", line 1386, in _handle_command
    return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 397, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/module.py", line 388, in dump_threads
    for th in threading.enumerate() if th.ident in sys._current_frames()]
  File "/usr/share/ceph/mgr/dashboard/module.py", line 388, in <listcomp>
    for th in threading.enumerate() if th.ident in sys._current_frames()]
<_DummyThread(Dummy-2, started daemon 140328779454208)>:
  File "/usr/share/ceph/mgr/dashboard/module.py", line 360, in serve
    self.shutdown_event.wait()
  File "/lib64/python3.6/threading.py", line 569, in wait
    signaled = self._cond.wait(timeout)
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
<Thread(HTTPServer Thread-4, started daemon 140328078325504)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/lib64/python3.6/threading.py", line 885, in run
    self._target(*self._args, **self._kwargs)
  File "/lib/python3.6/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread
    self.httpserver.start()
  File "/lib/python3.6/site-packages/cheroot/server.py", line 1837, in start
    self.serve()
  File "/lib/python3.6/site-packages/cheroot/server.py", line 1810, in serve
    self._connections.run(self.expiration_interval)
  File "/lib/python3.6/site-packages/cheroot/connections.py", line 201, in run
    self._run(expiration_interval)
  File "/lib/python3.6/site-packages/cheroot/connections.py", line 210, in _run
    active_list = self._selector.select(timeout=0.01)
  File "/lib/python3.6/site-packages/cheroot/connections.py", line 106, in select
    for key, _ in self._selector.select(timeout=timeout)
  File "/lib64/python3.6/selectors.py", line 445, in select
    fd_event_list = self._epoll.poll(timeout, max_ev)
<NotificationQueue(Thread-15, started daemon 140327986005760)>:
  File "/lib64/python3.6/threading.py", line 905, in _bootstrap
    self._bootstrap_inner()
  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
    self.run()
  File "/usr/share/ceph/mgr/dashboard/tools.py", line 401, in run
    self._cond.wait()
  File "/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()

#10 Updated by Ernesto Puerta 3 months ago

Thanks a lot, Zach! This was really useful. It is the clearest confirmation possible that the issue lies within the Cheroot scope: all the Cheroot server threads are gone.

Based on the most commented still open Cheroot issues, it seems that some exceptions/errors are not properly handled, and hence they lead to pool exhaustion.

Let's try and catch whichever exception comes from the inner loop in the thread pool:

diff --git a/lib/python3.6/site-packages/cheroot/workers/threadpool.py b/lib/python3.6/site-packages/cheroot/workers/threadpool.py
index 915934cc..16d92ce8 100644
--- a/cheroot/workers/threadpool.py
+++ b/cheroot/workers/threadpool.py
@@ -135,7 +135,7 @@ class WorkerThread(threading.Thread):
                         self.work_time += time.time() - self.start_time
                         self.start_time = None
                     self.conn = None
-        except (KeyboardInterrupt, SystemExit) as ex:
+        except Exception as ex:
             self.server.interrupt = ex

That should trigger the stop of the server and raise the exception in the caller thread. If the cherrypy.log.error_log.propagate is set to True in the ...dashboard/module.py, then the exception will be displayed in ceph-mgr logs, and hopefully this store comes to an end...

#11 Updated by Zach Heise 3 months ago

Okay, I'll put that into threadpool.py on ceph03 then. When you say "stop of the server" you mean then, that my dashboard, instead of just 'hanging' and spinning the throbber forever, will actually properly be killed and 404 when visited?

Also available in: Atom PDF