Actions
Bug #461
closedHanging OSD during recovery
Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
While my cluster was recovering from a few OSD crashes, one of my OSD's.
root@node02:~# ps aux|grep cosd root 14773 9.4 94.9 7929808 3849936 ? Dsl 20:38 2:04 /usr/bin/cosd -i 1 -c /etc/ceph/ceph.conf root 15490 0.0 0.0 7672 820 pts/0 D+ 21:00 0:00 grep --color=auto cosd root@node02:~#
As you can see, the OSD is using a lot of memory and is waiting for I/O.
The logs show:
root@node02:~# date Mon Oct 4 21:01:19 CEST 2010 root@node02:~# tail /var/log/ceph/osd.1.log 2010-10-04 20:44:55.539151 7f39ce8b1710 journal throttle: waited for ops 2010-10-04 20:44:55.595650 7f39ce8b1710 journal throttle: waited for ops 2010-10-04 20:44:55.689574 7f39ce8b1710 journal throttle: waited for ops 2010-10-04 20:44:55.799051 7f39ce8b1710 journal throttle: waited for ops 2010-10-04 20:44:55.828022 7f39ce8b1710 journal throttle: waited for ops 2010-10-04 20:44:55.859495 7f39ce8b1710 journal throttle: waited for ops 2010-10-04 20:44:55.977324 7f39ce8b1710 journal throttle: waited for ops 2010-10-04 20:44:56.007724 7f39ce8b1710 journal throttle: waited for ops 2010-10-04 20:44:56.068909 7f39ce8b1710 journal throttle: waited for ops 2010-10-04 20:44:56.126037 7f39ce8b1710 journal throttle: waited for ops root@node02:~#
As you can see, the OSD is hanging for about 20 minutes now.
Right now it is marked as "down" since it isn't responding to anything.
Killing the OSD won't work either, it just keeps hanging.
Actions