mgr/dashboard: a failure in rbd-mirror makes other dashboard pages fail
On QE testing, during a build upgrade, a previous rbd-mirror daemon got hung, and a new started running. While this situation is external to dashboard, it caused a failure not only in rbd related pages, but also in Pools or Hosts.
The cause is that the
summary endpoint raises an Exception:
traceback: "Traceback (most recent call last): File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 670, in respond response.body = self.handler() File "/lib/python3.6/site-packages/cherrypy/lib/encoding.py", line 220, in __call__ self.body = self.oldhandler(*args, **kwargs) File "/lib/python3.6/site-packages/cherrypy/_cptools.py", line 237, in wrap return self.newhandler(innerfunc, *args, **kwargs) File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 88, in dashboard_exception_handler return handler(*args, **kwargs) File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 60, in __call__ return self.callable(*self.args, **self.kwargs) File "/usr/share/ceph/mgr/dashboard/controllers/__init__.py", line 649, in inner ret = func(*args, **kwargs) File "/usr/share/ceph/mgr/dashboard/controllers/summary.py", line 86, in __call__ result['rbd_mirroring'] = self._rbd_mirroring() File "/usr/share/ceph/mgr/dashboard/controllers/summary.py", line 22, in _rbd_mirroring _, data = get_daemons_and_pools() File "/usr/share/ceph/mgr/dashboard/tools.py", line 244, in wrapper return rvc.run(fn, args, kwargs) File "/usr/share/ceph/mgr/dashboard/tools.py", line 226, in run raise self.exception File "/usr/share/ceph/mgr/dashboard/tools.py", line 147, in run val = self.fn(*self.args, **self.kwargs) File "/usr/share/ceph/mgr/dashboard/controllers/rbd_mirroring.py", line 185, in get_daemons_and_pools daemons = get_daemons() File "/usr/share/ceph/mgr/dashboard/controllers/rbd_mirroring.py", line 56, in get_daemons status = json.loads(status['json']) TypeError: 'NoneType' object is not subscriptableWhile dashboard cannot and (IMHO) shouldn't handle all possible failures in core Ceph components, it should be at least:
- resilient to those failures,
- if not possible, do not let failures impact other components (fault confinement).
The error described in this specific issue is easy to fix (catch
TypeError exception). However, this approach is hard to be maintained across all dashboard codebase (it'd result in defensive programming and scattered try-excepts every line of code).
A possible solution could be to add a validation & data adaptation layer between ceph-mgr API and the back-end. This layer would validate the expected inputs against a schema, and provide a single place to encode the fallback behaviour in case of validation failures (vs. scattered handling logic).