Bug #43008: mgr/dashboard: a failure in rbd-mirror makes other dashboard pages fail - Dashboard - Ceph

Actions

Copy link

Bug #43008

closed

mgr/dashboard: a failure in rbd-mirror makes other dashboard pages fail

Added by Ernesto Puerta over 4 years ago. Updated about 3 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Ernesto Puerta

Category:

Component - RBD Mirroring

Target version:

Ceph - v14.2.6

% Done:

Source:

Q/A

Tags:

Backport:

nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

31881

Crash signature (v1):

Crash signature (v2):

Description

On QE testing, during a build upgrade, a previous rbd-mirror daemon got hung, and a new started running. While this situation is external to dashboard, it caused a failure not only in rbd related pages, but also in Pools or Hosts.

The cause is that the summary endpoint raises an Exception:

traceback: "Traceback (most recent call last):
  File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/lib/python3.6/site-packages/cherrypy/lib/encoding.py", line 220, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/lib/python3.6/site-packages/cherrypy/_cptools.py", line 237, in wrap
    return self.newhandler(innerfunc, *args, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 88, in dashboard_exception_handler
    return handler(*args, **kwargs)
  File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 60, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/__init__.py", line 649, in inner
    ret = func(*args, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/summary.py", line 86, in __call__
    result['rbd_mirroring'] = self._rbd_mirroring()
  File "/usr/share/ceph/mgr/dashboard/controllers/summary.py", line 22, in _rbd_mirroring
    _, data = get_daemons_and_pools()
  File "/usr/share/ceph/mgr/dashboard/tools.py", line 244, in wrapper
    return rvc.run(fn, args, kwargs)
  File "/usr/share/ceph/mgr/dashboard/tools.py", line 226, in run
    raise self.exception
  File "/usr/share/ceph/mgr/dashboard/tools.py", line 147, in run
    val = self.fn(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/rbd_mirroring.py", line 185, in get_daemons_and_pools
    daemons = get_daemons()
  File "/usr/share/ceph/mgr/dashboard/controllers/rbd_mirroring.py", line 56, in get_daemons
    status = json.loads(status['json'])
TypeError: 'NoneType' object is not subscriptable

While dashboard cannot and (IMHO) shouldn't handle all possible failures in core Ceph components, it should be at least:

resilient to those failures,
if not possible, do not let failures impact other components (fault confinement).

The error described in this specific issue is easy to fix (catch TypeError exception). However, this approach is hard to be maintained across all dashboard codebase (it'd result in defensive programming and scattered try-excepts every line of code).

A possible solution could be to add a validation & data adaptation layer between ceph-mgr API and the back-end. This layer would validate the expected inputs against a schema, and provide a single place to encode the fallback behaviour in case of validation failures (vs. scattered handling logic).

Related issues 1 (0 open — 1 closed)