Bug #2299
closedall MDS commit suicide on startup
0%
Description
my setup is: 1 MON, 2 MDS and 4 OSD.
ceph version is commit:1e76a8713feac6883c648512dcdc28c83f7ff69e.
after copying about 300GB into the cluster and some reboots, the MDS servers choke on startup.
"ceph -s":
2012-04-14 18:25:18.838752 pg v49910: 594 pgs: 594 active+clean; 294 GB data, 584 GB used, 2426 GB / 3052 GB avail
2012-04-14 18:25:18.844659 mds e9201: 1/1/1 up {0=1=up:reconnect(laggy or crashed)}
2012-04-14 18:25:18.845061 osd e302: 4 osds: 4 up, 4 in
2012-04-14 18:25:18.845514 log 2012-04-14 18:13:57.753223 osd.0 192.168.32.177:6801/1505 163 : [WRN] mds.0 192.168.32.185:6800/6108 misdirected mds.0.63:45 1.b9 to osd.0 not [1,0] in e302/302
2012-04-14 18:25:18.853380 mon e2: 1 mons at {0=192.168.32.177:6789/0}
attached is a (short) log from starting one of the MDS.
if you need more detailed logs, i have an exhaustive log with debug level 99999999, but this is >250MB uncompressed (and 6MB compressed).
my question is: how do i repair this.
and the MDS should be changed to cope with this error condition instead of bailing out.
Files
Updated by Martin Scheffler about 12 years ago
after i told osd.0 to get lost and reformatted it, the cluster started resyncing.
then (magically) mds.0 started up ok.
but, the underlying problem with the MDS-server still needs to be fixed.
imho, the MDS could probe other OSDs for the blob in question.
trying to understand the source code right now.
Updated by Martin Scheffler about 12 years ago
this issue can be closed, there was an error in the underlying fileystem of osd.0 :)
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.