Project

General

Profile

Actions

Feature #37500

open

ceph status/health hang when they could give helpful hints

Added by Niklas Hambuechen over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Today I had an incident with my Ceph cluster that took down my infrastructure.

I am running Ceph(FS) 13.2.2 on Linux in triple-redundancy mode on 3 machines.

I had two uncorellated failures on two different levels during the night:

  • node 3 lost its network connection
  • node 2 ran out of disk space some hours later

With two out of 3 nodes gone, all accessess to my CephFS mount hung forever. The ceph-fuse process was still running, but `ceph status` and `ceph health` would hang forever and not produce any output.

Looking after doing an investigation, I eventually found in the ceph logs that ceph noticed node 3 no longer responding, and that it noticed the "very low" disk space on node 2.

This allowed me to address the issue, but it still took me ~1.5 hours of downtime until I had analysed the situation and recovered.

If `ceph status` and `ceph health` had not hung, but instead given me a hint that there were health problems (and which) before the other nodes stopped responding, I could have handled the situation much faster.

Thus I'm feature-requesting here that `ceph status` and `ceph health` be able to point out problematic cluster health conditions that were known to the local node, and output them before hanging when trying (and failing) to collect up-to-date health data from other nodes.

Actions

Also available in: Atom PDF