Bug #58921: Mgr crashing with dashboard module enabled in 16.2.9 - Orchestrator - Ceph

Actions

Copy link

Bug #58921

open

Mgr crashing with dashboard module enabled in 16.2.9

Added by Adrian Nicolae about 1 year ago. Updated about 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.9

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

We noticed some issues with the orchestrator. We'added new hosts with new drives which aren't automatically detected by the orchestrator. Checking the mgr logs I noticed it was crashing when having the dashboard module enabled (maybe the path has an extra backslash in the code) :

Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: Internal Server Error
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: Traceback (most recent call last):
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 58, in serve_file
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: st = os.stat(path)
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: FileNotFoundError: [Errno 2] No such file or directory: '/usr/share/ceph/mgr/dashboard/frontend/dist/en-US/prometheus_receiver'
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: During handling of the above exception, another exception occurred:
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: Traceback (most recent call last):
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47, in dashboard_exception_handler
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: return handler(*args, **kwargs)
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in call
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: return self.callable(*self.args, **self.kwargs)
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: File "/usr/share/ceph/mgr/dashboard/controllers/home.py", line 135, in call
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: return serve_file(full_path)
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 65, in serve_file
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: raise cherrypy.NotFound()
Feb 01 10:01:08 ds-ceph01-madrid bash^2829574: cherrypy._cperror.NotFound: (404, "The path '/prometheus_receiver' was not found.")

After disabling the dashboard module, the new drives were detected and the new osd containers (docker) were deployed.

However, I now noticed another orch issue even with the dashboard disabled :

- I have a failed drive (osd.92)

- the drive was marked as down and out, the rebalancing was fine

- I'm trying to purge the osd after the rebalancing  was completed in order to ask for a replacement with "ceph orch osd rm osd.92 --force".

- the purge does nothing :

ceph orch osd rm status
OSD  HOST    STATE    PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT
92   node10  started    0  False    True   False

- the osd daemons are not refreshed :

ceph orch ps --daemon_type osd --daemon_id 92
NAME    HOST    PORTS  STATUS  REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID
osd.92  node10         error      3d ago   4w        -    4096M  &lt;unknown&gt;  &lt;unknown&gt;

- I don't have any other errors in the mgr logs even with debug 20 activated

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Ernesto Puerta about 1 year ago

Is duplicate of Bug #55638: alertmanager webhook urls may lead to 404 added

Actions

Copy link

Updated by Ernesto Puerta about 1 year ago

This might be a duplicate of https://tracker.ceph.com/issues/55638. Pending to check whether that fix was backported to Pacific.

Actions

Copy link

Updated by Adam King about 1 year ago

Project changed from Ceph to Orchestrator
Category deleted (~~OSD~~)

the fix for the dashboard module crashing should be in pacific, but I think only in the 16.2.11 release. As for the osd removal stuff, if cephadm can't refresh the daemons or devices properly it doesn't go forward with other actions. The first step there needs to be to get the refresh going. Typically for this I'd suggest doing a mgr failover ("ceph mgr fail"), waiting a few minutes, then checking the REFRESHED column in "ceph orch ps" and "ceph orch device ls". Often when people hit these issues it's one specific host that can't be refreshed. Past known causes being things like a hanging mount point on the host causing ceph-volume inventory process to get stuck in D state (fixed later but not in 16.2.9) or the root partition on the host being full. Either way, once you see which host(s) are failing to be refreshed you can go investigate those hosts in particular to look for issues.

Actions

Copy link

Updated by Adrian Nicolae about 1 year ago

Indeed, it was a host causing all these orch issues. the node-exporter container went crazy, it was crashing constantly without the docker container being properly stopped :
Mar 07 16:04:01 node05 systemd¹: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 07 16:04:01 node05 systemd¹: ceph-aab55b64-b63b-11ec-87c7-3d1afe45b918@node-exporter.node05.service: Found left-over process 229584 (docker) in control group while starting unit. Ignoring.
Mar 07 16:04:01 node05 systemd¹: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 07 16:04:01 node05 systemd¹: ceph-aab55b64-b63b-11ec-87c7-3d1afe45b918@node-exporter.node05.service: Found left-over process 230101 (bash) in control group while starting unit. Ignoring.
Mar 07 16:04:01 node05 systemd¹: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 07 16:04:01 node05 systemd¹: ceph-aab55b64-b63b-11ec-87c7-3d1afe45b918@node-exporter.node05.service: Found left-over process 230119 (bash) in control group while starting unit. Ignoring.

We just rebooted the host , just stopping node-exporter service didn't help. After reboot, all the orch issues are gone.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #58921

Mgr crashing with dashboard module enabled in 16.2.9

Updated by Ernesto Puerta about 1 year ago

Updated by Ernesto Puerta about 1 year ago

Updated by Adam King about 1 year ago

Updated by Adrian Nicolae about 1 year ago