Bug #43008
mgr/dashboard: a failure in rbd-mirror makes other dashboard pages fail
0%
Description
On QE testing, during a build upgrade, a previous rbd-mirror daemon got hung, and a new started running. While this situation is external to dashboard, it caused a failure not only in rbd related pages, but also in Pools or Hosts.
The cause is that the summary
endpoint raises an Exception:
traceback: "Traceback (most recent call last): File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 670, in respond response.body = self.handler() File "/lib/python3.6/site-packages/cherrypy/lib/encoding.py", line 220, in __call__ self.body = self.oldhandler(*args, **kwargs) File "/lib/python3.6/site-packages/cherrypy/_cptools.py", line 237, in wrap return self.newhandler(innerfunc, *args, **kwargs) File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 88, in dashboard_exception_handler return handler(*args, **kwargs) File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 60, in __call__ return self.callable(*self.args, **self.kwargs) File "/usr/share/ceph/mgr/dashboard/controllers/__init__.py", line 649, in inner ret = func(*args, **kwargs) File "/usr/share/ceph/mgr/dashboard/controllers/summary.py", line 86, in __call__ result['rbd_mirroring'] = self._rbd_mirroring() File "/usr/share/ceph/mgr/dashboard/controllers/summary.py", line 22, in _rbd_mirroring _, data = get_daemons_and_pools() File "/usr/share/ceph/mgr/dashboard/tools.py", line 244, in wrapper return rvc.run(fn, args, kwargs) File "/usr/share/ceph/mgr/dashboard/tools.py", line 226, in run raise self.exception File "/usr/share/ceph/mgr/dashboard/tools.py", line 147, in run val = self.fn(*self.args, **self.kwargs) File "/usr/share/ceph/mgr/dashboard/controllers/rbd_mirroring.py", line 185, in get_daemons_and_pools daemons = get_daemons() File "/usr/share/ceph/mgr/dashboard/controllers/rbd_mirroring.py", line 56, in get_daemons status = json.loads(status['json']) TypeError: 'NoneType' object is not subscriptableWhile dashboard cannot and (IMHO) shouldn't handle all possible failures in core Ceph components, it should be at least:
- resilient to those failures,
- if not possible, do not let failures impact other components (fault confinement).
The error described in this specific issue is easy to fix (catch TypeError
exception). However, this approach is hard to be maintained across all dashboard codebase (it'd result in defensive programming and scattered try-excepts every line of code).
A possible solution could be to add a validation & data adaptation layer between ceph-mgr API and the back-end. This layer would validate the expected inputs against a schema, and provide a single place to encode the fallback behaviour in case of validation failures (vs. scattered handling logic).
Related issues
History
#1 Updated by Ernesto Puerta about 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 31881
#2 Updated by Ernesto Puerta about 4 years ago
- Assignee set to Ernesto Puerta
#3 Updated by Ernesto Puerta about 4 years ago
- Status changed from Fix Under Review to Duplicate
Duplicate of: https://github.com/ceph/ceph/pull/31907
#4 Updated by Ricardo Marques about 4 years ago
- Duplicates Bug #43029: mgr/dashboard: RBD mirroring page results in "500 - internal server error" added
#5 Updated by Ernesto Puerta over 2 years ago
- Project changed from mgr to Dashboard
- Category changed from 140 to Component - RBD Mirroring