Project

General

Profile

Bug #43008

mgr/dashboard: a failure in rbd-mirror makes other dashboard pages fail

Added by Ernesto Puerta 3 months ago. Updated 3 months ago.

Status:
Duplicate
Priority:
Normal
Category:
dashboard/rbd-mirror
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

On QE testing, during a build upgrade, a previous rbd-mirror daemon got hung, and a new started running. While this situation is external to dashboard, it caused a failure not only in rbd related pages, but also in Pools or Hosts.

The cause is that the summary endpoint raises an Exception:

traceback: "Traceback (most recent call last):
  File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/lib/python3.6/site-packages/cherrypy/lib/encoding.py", line 220, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/lib/python3.6/site-packages/cherrypy/_cptools.py", line 237, in wrap
    return self.newhandler(innerfunc, *args, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 88, in dashboard_exception_handler
    return handler(*args, **kwargs)
  File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 60, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/__init__.py", line 649, in inner
    ret = func(*args, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/summary.py", line 86, in __call__
    result['rbd_mirroring'] = self._rbd_mirroring()
  File "/usr/share/ceph/mgr/dashboard/controllers/summary.py", line 22, in _rbd_mirroring
    _, data = get_daemons_and_pools()
  File "/usr/share/ceph/mgr/dashboard/tools.py", line 244, in wrapper
    return rvc.run(fn, args, kwargs)
  File "/usr/share/ceph/mgr/dashboard/tools.py", line 226, in run
    raise self.exception
  File "/usr/share/ceph/mgr/dashboard/tools.py", line 147, in run
    val = self.fn(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/rbd_mirroring.py", line 185, in get_daemons_and_pools
    daemons = get_daemons()
  File "/usr/share/ceph/mgr/dashboard/controllers/rbd_mirroring.py", line 56, in get_daemons
    status = json.loads(status['json'])
TypeError: 'NoneType' object is not subscriptable

While dashboard cannot and (IMHO) shouldn't handle all possible failures in core Ceph components, it should be at least:
  • resilient to those failures,
  • if not possible, do not let failures impact other components (fault confinement).

The error described in this specific issue is easy to fix (catch TypeError exception). However, this approach is hard to be maintained across all dashboard codebase (it'd result in defensive programming and scattered try-excepts every line of code).

A possible solution could be to add a validation & data adaptation layer between ceph-mgr API and the back-end. This layer would validate the expected inputs against a schema, and provide a single place to encode the fallback behaviour in case of validation failures (vs. scattered handling logic).


Related issues

Duplicates mgr - Bug #43029: mgr/dashboard: RBD mirroring page results in "500 - internal server error" Resolved

History

#1 Updated by Ernesto Puerta 3 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 31881

#2 Updated by Ernesto Puerta 3 months ago

  • Assignee set to Ernesto Puerta

#3 Updated by Ernesto Puerta 3 months ago

  • Status changed from Fix Under Review to Duplicate

#4 Updated by Ricardo Marques 3 months ago

  • Duplicates Bug #43029: mgr/dashboard: RBD mirroring page results in "500 - internal server error" added

Also available in: Atom PDF