Feature #37500: ceph status/health hang when they could give helpful hints - RADOS - Ceph

Actions

Copy link

Feature #37500

open

ceph status/health hang when they could give helpful hints

Added by Niklas Hambuechen over 5 years ago. Updated over 5 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Administration/Usability

Target version:

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Ceph - v13.2.2

Component(RADOS):

Pull request ID:

Description

Today I had an incident with my Ceph cluster that took down my infrastructure.

I am running Ceph(FS) 13.2.2 on Linux in triple-redundancy mode on 3 machines.

I had two uncorellated failures on two different levels during the night:

node 3 lost its network connection
node 2 ran out of disk space some hours later

With two out of 3 nodes gone, all accessess to my CephFS mount hung forever. The ceph-fuse process was still running, but `ceph status` and `ceph health` would hang forever and not produce any output.

Looking after doing an investigation, I eventually found in the ceph logs that ceph noticed node 3 no longer responding, and that it noticed the "very low" disk space on node 2.

This allowed me to address the issue, but it still took me ~1.5 hours of downtime until I had analysed the situation and recovered.

If `ceph status` and `ceph health` had not hung, but instead given me a hint that there were health problems (and which) before the other nodes stopped responding, I could have handled the situation much faster.

Thus I'm feature-requesting here that `ceph status` and `ceph health` be able to point out problematic cluster health conditions that were known to the local node, and output them before hanging when trying (and failing) to collect up-to-date health data from other nodes.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Feature #37500

ceph status/health hang when they could give helpful hints

Updated by Greg Farnum over 5 years ago