Project

General

Profile

Actions

Feature #37500

open

ceph status/health hang when they could give helpful hints

Added by Niklas Hambuechen over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Today I had an incident with my Ceph cluster that took down my infrastructure.

I am running Ceph(FS) 13.2.2 on Linux in triple-redundancy mode on 3 machines.

I had two uncorellated failures on two different levels during the night:

  • node 3 lost its network connection
  • node 2 ran out of disk space some hours later

With two out of 3 nodes gone, all accessess to my CephFS mount hung forever. The ceph-fuse process was still running, but `ceph status` and `ceph health` would hang forever and not produce any output.

Looking after doing an investigation, I eventually found in the ceph logs that ceph noticed node 3 no longer responding, and that it noticed the "very low" disk space on node 2.

This allowed me to address the issue, but it still took me ~1.5 hours of downtime until I had analysed the situation and recovered.

If `ceph status` and `ceph health` had not hung, but instead given me a hint that there were health problems (and which) before the other nodes stopped responding, I could have handled the situation much faster.

Thus I'm feature-requesting here that `ceph status` and `ceph health` be able to point out problematic cluster health conditions that were known to the local node, and output them before hanging when trying (and failing) to collect up-to-date health data from other nodes.

Actions #1

Updated by Greg Farnum over 5 years ago

  • Project changed from Ceph to RADOS
  • Category set to Administration/Usability
  • Priority changed from Normal to Low

Hmm, perhaps we could fall back to outputting other commands when connections to the monitor seem to be hanging, as they might provide useful info?

Actions

Also available in: Atom PDF