Project

General

Profile

Actions

Bug #1186

closed

Cluster won't recover, OSD's go up and down again (and stay down)

Added by Wido den Hollander almost 13 years ago. Updated almost 13 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ok, the title might be somewhat confusing, but so is the issue :)

I'm still trying to get my 40 OSD cluster back into a healthy state, but this won't finish.

After a lot of bugs my OSD's don't assert anymore (for now), but they won't come up either.

I started my cluster (d2b7e291f21928f9f0a3e23fb32c94c9cbbc8984) this morning and slowly the OSD's started to come up:

Slowly I saw the OSD's coming up one by one until 26 up/in, after that it started to go down, until I reached:

2011-06-14 13:58:31.692401    pg v660657: 10608 pgs: 333 inactive, 153 active+clean, 107 active+degraded, 134 active+clean+degraded, 7269 crashed+down+peering, 2612 crashed+down+degraded+peering; 2108 GB data, 0 KB used, 0 KB / 0 KB avail; 245445/1626390 degraded (15.091%)
2011-06-14 13:58:31.692564   osd e43540: 40 osds: 0 up, 0 in

What I did notice, the whole time the state of the cluster stayed at:

2011-06-14 13:58:31.692401    pg v660657: 10608 pgs: 333 inactive, 153 active+clean, 107 active+degraded, 134 active+clean+degraded, 7269 crashed+down+peering, 2612 crashed+down+degraded+peering; 2108 GB data, 0 KB used, 0 KB / 0 KB avail; 245445/1626390 degraded (15.091%)
2011-06-14 13:58:31.692564   osd e43540: 40 osds: 0 up, 0 in

The first thing I did was verifying if all cosd processes are running and yes, they are.

root@monitor:~# dsh -g osd-mdb "pidof cosd|wc -w" 
4
4
4
4
4
4
4
4
3
3
root@monitor:~#

In the last two boxes I have two crashed disks, so I have 38 working OSD's.

At first I thought it is/was the Atom CPU, but the load on the machines isn't that high:

root@atom0:~# ps aux|grep cosd
root      3240 22.1 24.1 1493176 981080 ?      Ssl  11:54  29:02 /usr/bin/cosd -i 0 -c /etc/ceph/ceph.conf
root      3354 70.3  6.9 1336408 281720 ?      Ssl  11:54  91:58 /usr/bin/cosd -i 1 -c /etc/ceph/ceph.conf
root      3627 20.1 24.6 1523364 1002060 ?     Ssl  11:54  26:17 /usr/bin/cosd -i 2 -c /etc/ceph/ceph.conf
root      3900 20.0 27.0 1472972 1097488 ?     Ssl  11:54  26:07 /usr/bin/cosd -i 3 -c /etc/ceph/ceph.conf
root     10566  0.0  0.0   7676   828 pts/0    S+   14:04   0:00 grep --color=auto cosd
root@atom0:~# uptime
 14:04:53 up  2:46,  1 user,  load average: 1.00, 1.04, 1.16
root@atom0:~#

'debug osd = 20' is set on all the OSD's, so I have enough log information. Checking out the logs the OSD's all seem to be different stuff, but they actually are active and alive!

My goal is still to recover this cluster as it seems (imho) to be a pretty good test case for bringing a downed cluster back to life, isn't it?

The logs are groing pretty fast, about 15G per hour, so uploading isn't really an option.

Actions

Also available in: Atom PDF