Project

General

Profile

Actions

Bug #38057

open

"ceph -s" hangs indefinitely when a machine running a monitor has failed storage.

Added by Michael Jones over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

TL;DR; -- the bug is that "ceph -s" hangs indefinitely. It should report failure eventually.

I have a 3 node cluster, with 3 MONs, 3 MDSs, and 3 MGRs, and a handful of OSDs.

One of my nodes has a single SSD drive that stores the operating system, and then a handful of drives that are dedicated entirely to running OSDs.

That machine had a sudden SSD failure. It's still "online" in the sense that I can ping it, and attempting to ssh into it results in a password prompt (but can't login, sadly), but the backing storage for the operating system is hosed.

I haven't yet had an opportunity to get to the machine to perform maintenance so it's still "online" such.

What I've noticed, however, that's directly pertinent to Ceph, is that I'm completely unable to interact with my cluster.

As said in the subject, "ceph -s" hangs indefinitely.

mimir /var/log/ceph # ceph -s
2019-01-27 21:16:38.450 7f0d1249e700 0 monclient(hunting): authenticate timed out after 300
2019-01-27 21:21:38.451 7f0d1249e700 0 monclient(hunting): authenticate timed out after 300
2019-01-27 21:26:38.452 7f0d1249e700 0 monclient(hunting): authenticate timed out after 300

mimir /var/log/ceph # cat ceph.log
2019-01-27 21:05:46.222549 mon.mimir mon.1 [fda8:941:2491:1699:60fa:e622:8345:2162]:6789/0 1 : cluster [INF] mon.mimir calling monitor election
2019-01-27 21:05:46.264136 mon.fenrir mon.0 [fda8:941:2491:1699:b45:a2e6:1383:2b98]:6789/0 35971 : cluster [INF] mon.fenrir calling monitor election
2019-01-27 21:05:51.304989 mon.fenrir mon.0 [fda8:941:2491:1699:b45:a2e6:1383:2b98]:6789/0 35972 : cluster [INF] mon.fenrir is new leader, mons fenrir,mimir in quorum (ranks 0,1)
2019-01-27 21:05:51.351559 mon.fenrir mon.0 [fda8:941:2491:1699:b45:a2e6:1383:2b98]:6789/0 35977 : cluster [WRN] overall HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS daemons available; 1 MDSs report slow metadata IOs; 7 osds down; Reduced data availability: 1349 pgs inactive, 67 pgs down; Degraded data redundancy: 655258/1360358 objects degraded (48.168%), 314 pgs degraded, 314 pgs undersized; 1/3 mons down, quorum fenrir,mimir

And I've attached ceph-mon.mimir.log

Please provide instructions on getting better debug information.


Files

ceph-mon.mimir.log (534 KB) ceph-mon.mimir.log Michael Jones, 01/28/2019 03:33 AM
ceph.conf (953 Bytes) ceph.conf Michael Jones, 01/30/2019 06:22 PM
Actions

Also available in: Atom PDF