Bug #38057

"ceph -s" hangs indefinitely when a machine running a monitor has failed storage.

Added by Michael Jones 9 months ago. Updated 9 months ago.

Target version:
Start date:
Due date:
% Done:


3 - minor
Affected Versions:
Pull request ID:
Crash signature:


TL;DR; -- the bug is that "ceph -s" hangs indefinitely. It should report failure eventually.

I have a 3 node cluster, with 3 MONs, 3 MDSs, and 3 MGRs, and a handful of OSDs.

One of my nodes has a single SSD drive that stores the operating system, and then a handful of drives that are dedicated entirely to running OSDs.

That machine had a sudden SSD failure. It's still "online" in the sense that I can ping it, and attempting to ssh into it results in a password prompt (but can't login, sadly), but the backing storage for the operating system is hosed.

I haven't yet had an opportunity to get to the machine to perform maintenance so it's still "online" such.

What I've noticed, however, that's directly pertinent to Ceph, is that I'm completely unable to interact with my cluster.

As said in the subject, "ceph -s" hangs indefinitely.

mimir /var/log/ceph # ceph -s
2019-01-27 21:16:38.450 7f0d1249e700 0 monclient(hunting): authenticate timed out after 300
2019-01-27 21:21:38.451 7f0d1249e700 0 monclient(hunting): authenticate timed out after 300
2019-01-27 21:26:38.452 7f0d1249e700 0 monclient(hunting): authenticate timed out after 300

mimir /var/log/ceph # cat ceph.log
2019-01-27 21:05:46.222549 mon.mimir mon.1 [fda8:941:2491:1699:60fa:e622:8345:2162]:6789/0 1 : cluster [INF] mon.mimir calling monitor election
2019-01-27 21:05:46.264136 mon.fenrir mon.0 [fda8:941:2491:1699:b45:a2e6:1383:2b98]:6789/0 35971 : cluster [INF] mon.fenrir calling monitor election
2019-01-27 21:05:51.304989 mon.fenrir mon.0 [fda8:941:2491:1699:b45:a2e6:1383:2b98]:6789/0 35972 : cluster [INF] mon.fenrir is new leader, mons fenrir,mimir in quorum (ranks 0,1)
2019-01-27 21:05:51.351559 mon.fenrir mon.0 [fda8:941:2491:1699:b45:a2e6:1383:2b98]:6789/0 35977 : cluster [WRN] overall HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS daemons available; 1 MDSs report slow metadata IOs; 7 osds down; Reduced data availability: 1349 pgs inactive, 67 pgs down; Degraded data redundancy: 655258/1360358 objects degraded (48.168%), 314 pgs degraded, 314 pgs undersized; 1/3 mons down, quorum fenrir,mimir

And I've attached ceph-mon.mimir.log

Please provide instructions on getting better debug information.

ceph-mon.mimir.log View (534 KB) Michael Jones, 01/28/2019 03:33 AM

ceph.conf View (953 Bytes) Michael Jones, 01/30/2019 06:22 PM


#1 Updated by Michael Jones 9 months ago

I'll be performing maintenance on this machine soon.

This'll be the only chance anyone gets to get more debugging information out of it.

#2 Updated by Greg Farnum 9 months ago

  • Project changed from Ceph to RADOS
  • Component(RADOS) Monitor added

Is the dead node the one that isn't in quorum?
What's the ceph.conf on the client that can't complete "ceph -s"?

I think there are two possibilities here:
1) the client you're trying to query the cluster from doesn't only has the now-busted monitor listed in its config, so it can't fall back to trying the others and just hangs.
2) The busted monitor is still limping along enough to try and join the quorum, but fails its writes so can't complete an election, but is for some reason not timing out and suiciding.

#3 Updated by Michael Jones 9 months ago

The node that had the failed SSD is "hoenir"
The node that I'm trying to use ceph commands from is "mimir".

I've attached mimir's ceph.conf. The ceph.conf is identically on all 3 nodes.

I imagine that the node with the failed SSD is the one that isn't in the quorum. If you have any commands that you'd like me to run to find out, i'd be happy to.

I believe that #2 is the more likely.

Also available in: Atom PDF