Bug #802
closed
osd: failing to send heartbeats (btrfs hang?)
Added by Sage Weil about 13 years ago.
Updated about 13 years ago.
Description
mkcephfs -c /etc/ceph/ceph.conf --allhosts -k /etc/ceph/keyring.bin --mkbtrfs
service ceph -a start
watch osds flap and pgs (very) slowly create, peer, repeer, and otherwise creep toward active
- Assignee set to Sage Weil
- Subject changed from osds failing during first cluster startup on sepia to osd: failing to send heartbeats (btrfs hang?)
I think this is the same thing Jim is seeing.
Fixed a few different bugs in this area, although we haven't specifically figured out why Jim was seeing that weird hang in his workload.
- Target version changed from v0.25 to v0.25.1
- Target version changed from v0.25.1 to v0.25.2
- Translation missing: en.field_position set to 2
- Translation missing: en.field_story_points set to 3
- Translation missing: en.field_position deleted (
4)
- Translation missing: en.field_position set to 4
I noticed that btrfs tends to freeze quite often when I use a journal within the osd tree, or within a separate ods's filesystem. Not using a journal seems to alleviate the problem somewhat, but it still freezes every now and then. Kernels that exhibited this behavior, with ceph 0.24.3 and 0.25.1, were:
kernel-libre-2.6.35.10-74.fc14.1.x86_64
kernel-libre-2.6.35.11-83.fc14.x86_64
kernel-libre-2.6.38-0.rc8.git2.1.fc15.x86_64
A kernel that did NOT exhibit this problem (tested with 0.24.3 only) was:
kernel-libre-2.6.34.8-68.fc13.x86_64
I happen to maintain both the mon and an osd tree on a btrfs shared with other data, and other osds in stand-alone filesystems, and the problem seems to occur more often on the shared ones, particularly those most often modified by applications that take care of fsyncing files. That's all I've been able to gather so far.
- Target version changed from v0.25.2 to v0.25.3
- Target version changed from v0.25.3 to v0.27
- Translation missing: en.field_position deleted (
3)
- Translation missing: en.field_position set to 323
- Translation missing: en.field_story_points changed from 3 to 5
- Translation missing: en.field_position deleted (
323)
- Translation missing: en.field_position set to 323
- Status changed from New to Closed
We're chalking this up to insufficient CPU to handle all of the threads. There are some tricks we could play (bumping thread priorities and so forth) but they don't really address the core issue. Also, if we are too slow to handle IO, we are "failed" in some sense. Any response will need to be carefully thought out... :/
Also available in: Atom
PDF