Feature #37500
openceph status/health hang when they could give helpful hints
0%
Description
Today I had an incident with my Ceph cluster that took down my infrastructure.
I am running Ceph(FS) 13.2.2 on Linux in triple-redundancy mode on 3 machines.
I had two uncorellated failures on two different levels during the night:
- node 3 lost its network connection
- node 2 ran out of disk space some hours later
With two out of 3 nodes gone, all accessess to my CephFS mount hung forever. The ceph-fuse process was still running, but `ceph status` and `ceph health` would hang forever and not produce any output.
Looking after doing an investigation, I eventually found in the ceph logs that ceph noticed node 3 no longer responding, and that it noticed the "very low" disk space on node 2.
This allowed me to address the issue, but it still took me ~1.5 hours of downtime until I had analysed the situation and recovered.
If `ceph status` and `ceph health` had not hung, but instead given me a hint that there were health problems (and which) before the other nodes stopped responding, I could have handled the situation much faster.
Thus I'm feature-requesting here that `ceph status` and `ceph health` be able to point out problematic cluster health conditions that were known to the local node, and output them before hanging when trying (and failing) to collect up-to-date health data from other nodes.