Bug #38057: "ceph -s" hangs indefinitely when a machine running a monitor has failed storage. - RADOS - Ceph

Actions

Copy link

Bug #38057

open

"ceph -s" hangs indefinitely when a machine running a monitor has failed storage.

Added by Michael Jones about 5 years ago. Updated about 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

TL;DR; -- the bug is that "ceph -s" hangs indefinitely. It should report failure eventually.

I have a 3 node cluster, with 3 MONs, 3 MDSs, and 3 MGRs, and a handful of OSDs.

One of my nodes has a single SSD drive that stores the operating system, and then a handful of drives that are dedicated entirely to running OSDs.

That machine had a sudden SSD failure. It's still "online" in the sense that I can ping it, and attempting to ssh into it results in a password prompt (but can't login, sadly), but the backing storage for the operating system is hosed.

I haven't yet had an opportunity to get to the machine to perform maintenance so it's still "online" such.

What I've noticed, however, that's directly pertinent to Ceph, is that I'm completely unable to interact with my cluster.

As said in the subject, "ceph -s" hangs indefinitely.

mimir /var/log/ceph # ceph -s
2019-01-27 21:16:38.450 7f0d1249e700 0 monclient(hunting): authenticate timed out after 300
2019-01-27 21:21:38.451 7f0d1249e700 0 monclient(hunting): authenticate timed out after 300
2019-01-27 21:26:38.452 7f0d1249e700 0 monclient(hunting): authenticate timed out after 300

mimir /var/log/ceph # cat ceph.log
2019-01-27 21:05:46.222549 mon.mimir mon.1 [fda8:941:2491:1699:60fa:e622:8345:2162]:6789/0 1 : cluster [INF] mon.mimir calling monitor election
2019-01-27 21:05:46.264136 mon.fenrir mon.0 [fda8:941:2491:1699:b45:a2e6:1383:2b98]:6789/0 35971 : cluster [INF] mon.fenrir calling monitor election
2019-01-27 21:05:51.304989 mon.fenrir mon.0 [fda8:941:2491:1699:b45:a2e6:1383:2b98]:6789/0 35972 : cluster [INF] mon.fenrir is new leader, mons fenrir,mimir in quorum (ranks 0,1)
2019-01-27 21:05:51.351559 mon.fenrir mon.0 [fda8:941:2491:1699:b45:a2e6:1383:2b98]:6789/0 35977 : cluster [WRN] overall HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS daemons available; 1 MDSs report slow metadata IOs; 7 osds down; Reduced data availability: 1349 pgs inactive, 67 pgs down; Degraded data redundancy: 655258/1360358 objects degraded (48.168%), 314 pgs degraded, 314 pgs undersized; 1/3 mons down, quorum fenrir,mimir

And I've attached ceph-mon.mimir.log

Please provide instructions on getting better debug information.

Files

Download all files

ceph-mon.mimir.log (534 KB) ceph-mon.mimir.log		Michael Jones, 01/28/2019 03:33 AM
ceph.conf (953 Bytes) ceph.conf		Michael Jones, 01/30/2019 06:22 PM

Actions

Copy link

Updated by Michael Jones about 5 years ago

I'll be performing maintenance on this machine soon.

This'll be the only chance anyone gets to get more debugging information out of it.

Actions

Copy link

Updated by Greg Farnum about 5 years ago

Project changed from Ceph to RADOS
Component(RADOS) Monitor added

Is the dead node the one that isn't in quorum?
What's the ceph.conf on the client that can't complete "ceph -s"?

I think there are two possibilities here:
1) the client you're trying to query the cluster from doesn't only has the now-busted monitor listed in its config, so it can't fall back to trying the others and just hangs.
2) The busted monitor is still limping along enough to try and join the quorum, but fails its writes so can't complete an election, but is for some reason not timing out and suiciding.

Actions

Copy link

Updated by Michael Jones about 5 years ago

File ceph.conf ceph.conf added

The node that had the failed SSD is "hoenir"
The node that I'm trying to use ceph commands from is "mimir".

I've attached mimir's ceph.conf. The ceph.conf is identically on all 3 nodes.

I imagine that the node with the failed SSD is the one that isn't in the quorum. If you have any commands that you'd like me to run to find out, i'd be happy to.

I believe that #2 is the more likely.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #38057

"ceph -s" hangs indefinitely when a machine running a monitor has failed storage.

Updated by Michael Jones about 5 years ago

Updated by Greg Farnum about 5 years ago

Updated by Michael Jones about 5 years ago