Project

General

Profile

Actions

Bug #58921

open

Mgr crashing with dashboard module enabled in 16.2.9

Added by Adrian Nicolae about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

We noticed some issues with the orchestrator. We'added new hosts with new drives which aren't automatically detected by the orchestrator. Checking the mgr logs I noticed it was crashing when having the dashboard module enabled (maybe the path has an extra backslash in the code) :

Feb 01 10:01:08 ds-ceph01-madrid bash2829574: Internal Server Error
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: Traceback (most recent call last):
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 58, in serve_file
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: st = os.stat(path)
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: FileNotFoundError: [Errno 2] No such file or directory: '/usr/share/ceph/mgr/dashboard/frontend/dist/en-US/prometheus_receiver'
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: During handling of the above exception, another exception occurred:
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: Traceback (most recent call last):
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47, in dashboard_exception_handler
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: return handler(*args, **kwargs)
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in call
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: return self.callable(*self.args, **self.kwargs)
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: File "/usr/share/ceph/mgr/dashboard/controllers/home.py", line 135, in call
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: return serve_file(full_path)
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 65, in serve_file
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: raise cherrypy.NotFound()
Feb 01 10:01:08 ds-ceph01-madrid bash2829574: cherrypy._cperror.NotFound: (404, "The path '/prometheus_receiver' was not found.")

After disabling the dashboard module, the new drives were detected and the new osd containers (docker) were deployed.
However, I now noticed another orch issue even with the dashboard disabled :
- I have a failed drive (osd.92)
- the drive was marked as down and out, the rebalancing was fine
- I'm trying to purge the osd after the rebalancing  was completed in order to ask for a replacement with "ceph orch osd rm osd.92 --force".
- the purge does nothing :
ceph orch osd rm status
OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT
92 node10 started 0 False True False
- the osd daemons are not refreshed :
ceph orch ps --daemon_type osd --daemon_id 92
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID
osd.92 node10 error 3d ago 4w - 4096M <unknown> <unknown>
- I don't have any other errors in the mgr logs even with debug 20 activated

Related issues 1 (0 open1 closed)

Is duplicate of Orchestrator - Bug #55638: alertmanager webhook urls may lead to 404ResolvedRedouane Kachach Elhichou

Actions
Actions #1

Updated by Ernesto Puerta about 1 year ago

  • Is duplicate of Bug #55638: alertmanager webhook urls may lead to 404 added
Actions #2

Updated by Ernesto Puerta about 1 year ago

This might be a duplicate of https://tracker.ceph.com/issues/55638. Pending to check whether that fix was backported to Pacific.

Actions #3

Updated by Adam King about 1 year ago

  • Project changed from Ceph to Orchestrator
  • Category deleted (OSD)

the fix for the dashboard module crashing should be in pacific, but I think only in the 16.2.11 release. As for the osd removal stuff, if cephadm can't refresh the daemons or devices properly it doesn't go forward with other actions. The first step there needs to be to get the refresh going. Typically for this I'd suggest doing a mgr failover ("ceph mgr fail"), waiting a few minutes, then checking the REFRESHED column in "ceph orch ps" and "ceph orch device ls". Often when people hit these issues it's one specific host that can't be refreshed. Past known causes being things like a hanging mount point on the host causing ceph-volume inventory process to get stuck in D state (fixed later but not in 16.2.9) or the root partition on the host being full. Either way, once you see which host(s) are failing to be refreshed you can go investigate those hosts in particular to look for issues.

Actions #4

Updated by Adrian Nicolae about 1 year ago

Indeed, it was a host causing all these orch issues. the node-exporter container went crazy, it was crashing constantly without the docker container being properly stopped :
Mar 07 16:04:01 node05 systemd1: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 07 16:04:01 node05 systemd1: : Found left-over process 229584 (docker) in control group while starting unit. Ignoring.
Mar 07 16:04:01 node05 systemd1: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 07 16:04:01 node05 systemd1: : Found left-over process 230101 (bash) in control group while starting unit. Ignoring.
Mar 07 16:04:01 node05 systemd1: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 07 16:04:01 node05 systemd1: : Found left-over process 230119 (bash) in control group while starting unit. Ignoring.

We just rebooted the host , just stopping node-exporter service didn't help. After reboot, all the orch issues are gone.

Actions

Also available in: Atom PDF