Bug #1186
closedCluster won't recover, OSD's go up and down again (and stay down)
0%
Description
Ok, the title might be somewhat confusing, but so is the issue :)
I'm still trying to get my 40 OSD cluster back into a healthy state, but this won't finish.
After a lot of bugs my OSD's don't assert anymore (for now), but they won't come up either.
I started my cluster (d2b7e291f21928f9f0a3e23fb32c94c9cbbc8984) this morning and slowly the OSD's started to come up:
Slowly I saw the OSD's coming up one by one until 26 up/in, after that it started to go down, until I reached:
2011-06-14 13:58:31.692401 pg v660657: 10608 pgs: 333 inactive, 153 active+clean, 107 active+degraded, 134 active+clean+degraded, 7269 crashed+down+peering, 2612 crashed+down+degraded+peering; 2108 GB data, 0 KB used, 0 KB / 0 KB avail; 245445/1626390 degraded (15.091%) 2011-06-14 13:58:31.692564 osd e43540: 40 osds: 0 up, 0 in
What I did notice, the whole time the state of the cluster stayed at:
2011-06-14 13:58:31.692401 pg v660657: 10608 pgs: 333 inactive, 153 active+clean, 107 active+degraded, 134 active+clean+degraded, 7269 crashed+down+peering, 2612 crashed+down+degraded+peering; 2108 GB data, 0 KB used, 0 KB / 0 KB avail; 245445/1626390 degraded (15.091%) 2011-06-14 13:58:31.692564 osd e43540: 40 osds: 0 up, 0 in
The first thing I did was verifying if all cosd processes are running and yes, they are.
root@monitor:~# dsh -g osd-mdb "pidof cosd|wc -w" 4 4 4 4 4 4 4 4 3 3 root@monitor:~#
In the last two boxes I have two crashed disks, so I have 38 working OSD's.
At first I thought it is/was the Atom CPU, but the load on the machines isn't that high:
root@atom0:~# ps aux|grep cosd root 3240 22.1 24.1 1493176 981080 ? Ssl 11:54 29:02 /usr/bin/cosd -i 0 -c /etc/ceph/ceph.conf root 3354 70.3 6.9 1336408 281720 ? Ssl 11:54 91:58 /usr/bin/cosd -i 1 -c /etc/ceph/ceph.conf root 3627 20.1 24.6 1523364 1002060 ? Ssl 11:54 26:17 /usr/bin/cosd -i 2 -c /etc/ceph/ceph.conf root 3900 20.0 27.0 1472972 1097488 ? Ssl 11:54 26:07 /usr/bin/cosd -i 3 -c /etc/ceph/ceph.conf root 10566 0.0 0.0 7676 828 pts/0 S+ 14:04 0:00 grep --color=auto cosd root@atom0:~# uptime 14:04:53 up 2:46, 1 user, load average: 1.00, 1.04, 1.16 root@atom0:~#
'debug osd = 20' is set on all the OSD's, so I have enough log information. Checking out the logs the OSD's all seem to be different stuff, but they actually are active and alive!
My goal is still to recover this cluster as it seems (imho) to be a pretty good test case for bringing a downed cluster back to life, isn't it?
The logs are groing pretty fast, about 15G per hour, so uploading isn't really an option.