Bug #2299: all MDS commit suicide on startup - CephFS - Ceph

Actions

Copy link

Bug #2299

closed

all MDS commit suicide on startup

Added by Martin Scheffler about 12 years ago. Updated over 7 years ago.

Status:

Rejected

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

my setup is: 1 MON, 2 MDS and 4 OSD.
ceph version is commit:1e76a8713feac6883c648512dcdc28c83f7ff69e.

after copying about 300GB into the cluster and some reboots, the MDS servers choke on startup.
"ceph -s":
2012-04-14 18:25:18.838752 pg v49910: 594 pgs: 594 active+clean; 294 GB data, 584 GB used, 2426 GB / 3052 GB avail
2012-04-14 18:25:18.844659 mds e9201: 1/1/1 up {0=1=up:reconnect(laggy or crashed)}
2012-04-14 18:25:18.845061 osd e302: 4 osds: 4 up, 4 in
2012-04-14 18:25:18.845514 log 2012-04-14 18:13:57.753223 osd.0 192.168.32.177:6801/1505 163 : [WRN] mds.0 192.168.32.185:6800/6108 misdirected mds.0.63:45 1.b9 to osd.0 not [1,0] in e302/302
2012-04-14 18:25:18.853380 mon e2: 1 mons at {0=192.168.32.177:6789/0}

attached is a (short) log from starting one of the MDS.
if you need more detailed logs, i have an exhaustive log with debug level 99999999, but this is >250MB uncompressed (and 6MB compressed).

my question is: how do i repair this.

and the MDS should be changed to cope with this error condition instead of bailing out.

Files

mds.1-short.log (29.6 KB) mds.1-short.log

Martin Scheffler, 04/14/2012 09:39 AM

Actions

Copy link

Updated by Martin Scheffler about 12 years ago

after i told osd.0 to get lost and reformatted it, the cluster started resyncing.
then (magically) mds.0 started up ok.
but, the underlying problem with the MDS-server still needs to be fixed.
imho, the MDS could probe other OSDs for the blob in question.
trying to understand the source code right now.

Actions

Copy link