Project

General

Profile

Bug #48973

mgr/dashboard: dashboard hangs when accessing it

Added by Ernesto Puerta 3 months ago. Updated 22 days ago.

Status:
Resolved
Priority:
Immediate
Category:
General
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description of problem

On first try to access dashboard (typing https://<dashboard_url> and pressing ENTER) the browser (tested in Chrome Version 87.0.4280.88 (Official Build) (64-bit) and Firefox 84.0.1 (64-bit)) doesn't show any page and it simply keeps waiting.

Environment

  • ceph version string:
    • Initially identified in master Jan 24th (97480142a69e7ff5bd2abaceb42cffe4b749d00c)
    • Reproduced in Pacific (16.1.0)
    • Also 1 month ago, Dec 24th (6793756f45f669240de952edd92946541385d090). This discards latest changes to ceph-mgr C++ code related to GIL/locks.
    • Dec 16 (63a5cd41c8b4e1ff5ee01854b4aa1425fe2da1bf). This discards CVE changes, including JWT and account lock-out.
  • Platform (OS/distro/release):
    • CentOS 8.3 / Fedora 32
    • python3-cherrypy-18.4.0-1.el8.noarch
    • NOT REPRODUCED in OpenSUSE Tumbleweed (Cherrypy 18.6.0-2.1 - Cheroot 8.3.0)
  • Cluster details (nodes, monitors, OSDs): minimal vstart cluster with 1 mon + 1 mgr + 3 OSDs. It happens as well in Cephadm deployments.
  • Browser used (e.g.: Version 86.0.4240.198 (Official Build) (64-bit)):
    • Chrome Version 87.0.4280.88 (Official Build) (64-bit)
    • Firefox 84.0.1 (64-bit)
  • Other:
    • NOT REPRODUCED with plain HTTP (HTTPS is required) It happens too, so this seems to relate the elapsed time for establishing the connection with the likelihood the issue to pop up (HTTP < static assets over HTTPS < HTTPS + Auth).

How reproducible

From a freshly launched dashboard (or an immediately restarted mgr), wait until the initialization finishes (curl -kv https://<dashboard_url> returns the index.html). Then switch to a browser (either Chrome or Firefox) and type the dashboard URL in the navigation bar and press ENTER. That's enough to trigger the issue.

Sporadic requests via `curl` don't trigger the issue. It happens when multiple requests are issued at the same time. It can be reproduced from the CLI with Apache benchmark:

> ab -c20 -n1000 "https://<dashboard_url>/docs" 

Benchmarking <dashboard_url> (be patient)
Completed 1000 requests
Completed 2000 requests
SSL handshake failed (5).
Completed 3000 requests
SSL handshake failed (5).
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
...

Complete requests:      10000
Failed requests:        2
   (Connect: 0, Receive: 0, Length: 2, Exceptions: 0)
Total transferred:      13387322 bytes

Actual results

Dashboard login page is not displaying and the browser keeps loading/waiting until manually stopped (minutes). After that, the curl requests no longer work:

curl -kv https://localhost:11000
* Rebuilt URL to: https://localhost:11000/
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 11000 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: O=IT; CN=ceph-dashboard
*  start date: Jan 20 16:42:29 2021 GMT
*  expire date: Jan 18 16:42:29 2031 GMT
*  issuer: O=IT; CN=ceph-dashboard
*  SSL certificate verify result: self signed certificate (18), continuing anyway.
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> GET / HTTP/1.1
> Host: localhost:11000
> User-Agent: curl/7.61.1
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):

Expected results

Dashboard login page loads normally.

Additional info

Current efforts are led towards finding where this issue comes from:
  • Dashboard Python code:
  • Ceph-mgr Python
  • Ceph-mgr C++
  • CherryPy: reproduced with both builtin and PyOpenSSL transport wrappers.

History

#1 Updated by Ernesto Puerta 3 months ago

  • Description updated (diff)

#2 Updated by Ernesto Puerta 3 months ago

  • Description updated (diff)

#3 Updated by Ernesto Puerta 3 months ago

  • Description updated (diff)

#4 Updated by Ernesto Puerta 3 months ago

  • Description updated (diff)

#5 Updated by Ernesto Puerta 3 months ago

  • Description updated (diff)

#6 Updated by Ken Dreyer 3 months ago

Ernesto, I updated CherryPy to 18.6.0 today in Rawhide and I published el8 RPMs at https://fedorapeople.org/~ktdreyer/bz1777494/ , if it helps for testing.

#7 Updated by Ernesto Puerta 3 months ago

  • Status changed from Need More Info to In Progress

Thanks a lot, Ken!

In fact it seems that the issue doesn't come from CherryPy but from Cheroot, the Cherrypy webserver engine. We've seen EPEL 8 provides Cheroot v8.5.1 (a non stable version from this Dec 16th, while the last one labeled as stable is v8.4.5 from August).

I manually applied this patch to 8.5.1 and the issue vanished... So, how can we report EPEL maintainers to update this package to 8.5.2 or keep it to the latest stable (8.4.5, which I already tested and doesn't exhibit this issue)?

#9 Updated by Ernesto Puerta 3 months ago

  • Description updated (diff)

#10 Updated by Ken Dreyer 3 months ago

Yeah, we brought Cheroot v8.5.1 to EPEL 8 for #47875.

I built Cheroot v8.5.2 at https://fedorapeople.org/~ktdreyer/bz1920461/ , want to test it? It seems to fix this issue for me.

#11 Updated by Ken Dreyer 3 months ago

Justin pushed v8.5.2 to Bodhi at https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-848a87b9dc, so this will go to the epel-testing Yum repository in the next day or so when the Fedora admins push that to testing.

#12 Updated by Alfonso Martínez 3 months ago

Ken Dreyer wrote:

Yeah, we brought Cheroot v8.5.1 to EPEL 8 for #47875.

I built Cheroot v8.5.2 at https://fedorapeople.org/~ktdreyer/bz1920461/ , want to test it? It seems to fix this issue for me.

Hi Ken,

I tested v8.5.2 (adding that repo in https://github.com/rhcs-dashboard/ceph-dev/ centos8 container and upgrading the package): it fixes the problem.

#13 Updated by Ken Dreyer 3 months ago

Thank you for adding karma in Bodhi. This should go out to EPEL's stable repo this week.

#14 Updated by Ernesto Puerta 3 months ago

  • Status changed from In Progress to Resolved
  • Backport deleted (pacific)

No need to backport as this came from an external dependency and was fixed there (EPEL 8).

#15 Updated by Ernesto Puerta 22 days ago

  • Project changed from mgr to Dashboard
  • Category changed from dashboard/general to General

Also available in: Atom PDF