Project

General

Profile

Actions

Bug #461

closed

Hanging OSD during recovery

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While my cluster was recovering from a few OSD crashes, one of my OSD's.

root@node02:~# ps aux|grep cosd
root     14773  9.4 94.9 7929808 3849936 ?     Dsl  20:38   2:04 /usr/bin/cosd -i 1 -c /etc/ceph/ceph.conf
root     15490  0.0  0.0   7672   820 pts/0    D+   21:00   0:00 grep --color=auto cosd
root@node02:~#

As you can see, the OSD is using a lot of memory and is waiting for I/O.

The logs show:

root@node02:~# date
Mon Oct  4 21:01:19 CEST 2010
root@node02:~# tail /var/log/ceph/osd.1.log
2010-10-04 20:44:55.539151 7f39ce8b1710 journal throttle: waited for ops
2010-10-04 20:44:55.595650 7f39ce8b1710 journal throttle: waited for ops
2010-10-04 20:44:55.689574 7f39ce8b1710 journal throttle: waited for ops
2010-10-04 20:44:55.799051 7f39ce8b1710 journal throttle: waited for ops
2010-10-04 20:44:55.828022 7f39ce8b1710 journal throttle: waited for ops
2010-10-04 20:44:55.859495 7f39ce8b1710 journal throttle: waited for ops
2010-10-04 20:44:55.977324 7f39ce8b1710 journal throttle: waited for ops
2010-10-04 20:44:56.007724 7f39ce8b1710 journal throttle: waited for ops
2010-10-04 20:44:56.068909 7f39ce8b1710 journal throttle: waited for ops
2010-10-04 20:44:56.126037 7f39ce8b1710 journal throttle: waited for ops
root@node02:~#

As you can see, the OSD is hanging for about 20 minutes now.

Right now it is marked as "down" since it isn't responding to anything.

Killing the OSD won't work either, it just keeps hanging.

Actions #1

Updated by Wido den Hollander over 13 years ago

  • Status changed from New to Closed

The OSD shutted down after about 3 hours it seems without any logging, so we probably won't find what ever caused the OSD to hang like this.

Actions #2

Updated by Wido den Hollander over 13 years ago

While testing #462, I restarted osd6 to see if the cephx problems went await.

During boot, osd6 started to hang too, just like osd1 did.

The last few loglines where:

2010-10-05 06:39:02.796850 7f8f6dd6e720 filestore(/srv/ceph/osd.6) mount btrfs SNAP_DESTROY is supported
2010-10-05 06:39:02.797005 7f8f6dd6e720 filestore(/srv/ceph/osd.6) mount fsid is 179385321
2010-10-05 06:39:02.797114 7f8f6dd6e720 filestore(/srv/ceph/osd.6) mount found snaps <>
2010-10-05 06:39:02.808408 7f8f6dd6e720 filestore(/srv/ceph/osd.6) mount op_seq is 3177306
2010-10-05 06:39:02.808423 7f8f6dd6e720 filestore(/srv/ceph/osd.6) open_journal at /dev/sda6
2010-10-05 06:39:03.087581 7f8f6dd6e720 journal read_entry 579010560 : seq 3177307 1048673 bytes
2010-10-05 06:39:03.089848 7f8f6dd6e720 journal read_entry 579010560 : seq 3177307 1048673 bytes
2010-10-05 06:39:03.089970 7f8f6dd6e720 filestore(/srv/ceph/osd.6) _do_transaction on 0x1066380
2010-10-05 06:39:03.090023 7f8f6dd6e720 filestore(/srv/ceph/osd.6) write /srv/ceph/osd.6/current/temp/rb.0.1.00000000094a_head 2097152~1048576

I then tried a "du -sh" on /srv/ceph/osd.6/current/temp/rb.0.1.00000000094a_head but that kept hanging to, might this be a btrfs bug where I/O access is hanging?

There are no messages in the dmesg at all, no btrfs nor kernel messages.

Actions

Also available in: Atom PDF