mgr/dashboard: dashboard hangs when accessing it
Description of problem¶
On first try to access dashboard (typing
https://<dashboard_url> and pressing ENTER) the browser (tested in Chrome
Version 87.0.4280.88 (Official Build) (64-bit) and Firefox
84.0.1 (64-bit)) doesn't show any page and it simply keeps waiting.
- Initially identified in master Jan 24th (97480142a69e7ff5bd2abaceb42cffe4b749d00c)
- Reproduced in Pacific (16.1.0)
- Also 1 month ago, Dec 24th (6793756f45f669240de952edd92946541385d090). This discards latest changes to ceph-mgr C++ code related to GIL/locks.
- Dec 16 (63a5cd41c8b4e1ff5ee01854b4aa1425fe2da1bf). This discards CVE changes, including JWT and account lock-out.
- Platform (OS/distro/release):
- CentOS 8.3 / Fedora 32
- NOT REPRODUCED in OpenSUSE Tumbleweed (Cherrypy 18.6.0-2.1 - Cheroot 8.3.0)
- Cluster details (nodes, monitors, OSDs): minimal vstart cluster with 1 mon + 1 mgr + 3 OSDs. It happens as well in Cephadm deployments.
- Browser used (e.g.:
Version 86.0.4240.198 (Official Build) (64-bit)):
Version 87.0.4280.88 (Official Build) (64-bit)
NOT REPRODUCED with plain HTTP (HTTPS is required)It happens too, so this seems to relate the elapsed time for establishing the connection with the likelihood the issue to pop up (HTTP < static assets over HTTPS < HTTPS + Auth).
From a freshly launched dashboard (or an immediately restarted mgr), wait until the initialization finishes (
curl -kv https://<dashboard_url> returns the
index.html). Then switch to a browser (either Chrome or Firefox) and type the dashboard URL in the navigation bar and press ENTER. That's enough to trigger the issue.
Sporadic requests via `curl` don't trigger the issue. It happens when multiple requests are issued at the same time. It can be reproduced from the CLI with Apache benchmark:
> ab -c20 -n1000 "https://<dashboard_url>/docs" Benchmarking <dashboard_url> (be patient) Completed 1000 requests Completed 2000 requests SSL handshake failed (5). Completed 3000 requests SSL handshake failed (5). Completed 4000 requests Completed 5000 requests Completed 6000 requests ... Complete requests: 10000 Failed requests: 2 (Connect: 0, Receive: 0, Length: 2, Exceptions: 0) Total transferred: 13387322 bytes
Dashboard login page is not displaying and the browser keeps loading/waiting until manually stopped (minutes). After that, the
curl requests no longer work:
curl -kv https://localhost:11000 * Rebuilt URL to: https://localhost:11000/ * Trying ::1... * TCP_NODELAY set * Connected to localhost (::1) port 11000 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, [no content] (0): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server did not agree to a protocol * Server certificate: * subject: O=IT; CN=ceph-dashboard * start date: Jan 20 16:42:29 2021 GMT * expire date: Jan 18 16:42:29 2031 GMT * issuer: O=IT; CN=ceph-dashboard * SSL certificate verify result: self signed certificate (18), continuing anyway. * TLSv1.3 (OUT), TLS app data, [no content] (0): > GET / HTTP/1.1 > Host: localhost:11000 > User-Agent: curl/7.61.1 > Accept: */* > * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
Dashboard login page loads normally.
Additional info¶Current efforts are led towards finding where this issue comes from:
- Dashboard Python code:
- Ceph-mgr Python
- Ceph-mgr C++
- CherryPy: reproduced with both
#7 Updated by Ernesto Puerta about 1 month ago
- Status changed from Need More Info to In Progress
Thanks a lot, Ken!
In fact it seems that the issue doesn't come from CherryPy but from Cheroot, the Cherrypy webserver engine. We've seen EPEL 8 provides Cheroot v8.5.1 (a non stable version from this Dec 16th, while the last one labeled as stable is v8.4.5 from August).
I manually applied this patch to 8.5.1 and the issue vanished... So, how can we report EPEL maintainers to update this package to 8.5.2 or keep it to the latest stable (8.4.5, which I already tested and doesn't exhibit this issue)?
#8 Updated by Ernesto Puerta about 1 month ago
BZ opened to EPEL project: https://bugzilla.redhat.com/show_bug.cgi?id=1920461
#11 Updated by Ken Dreyer about 1 month ago
Justin pushed v8.5.2 to Bodhi at https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-848a87b9dc, so this will go to the epel-testing Yum repository in the next day or so when the Fedora admins push that to testing.
#12 Updated by Alfonso Martínez 30 days ago
Ken Dreyer wrote:
Yeah, we brought Cheroot v8.5.1 to EPEL 8 for #47875.
I built Cheroot v8.5.2 at https://fedorapeople.org/~ktdreyer/bz1920461/ , want to test it? It seems to fix this issue for me.
I tested v8.5.2 (adding that repo in https://github.com/rhcs-dashboard/ceph-dev/ centos8 container and upgrading the package): it fixes the problem.