Project

General

Profile

Actions

Bug #802

closed

osd: failing to send heartbeats (btrfs hang?)

Added by Sage Weil about 13 years ago. Updated about 13 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

mkcephfs -c /etc/ceph/ceph.conf --allhosts -k /etc/ceph/keyring.bin --mkbtrfs
service ceph -a start
watch osds flap and pgs (very) slowly create, peer, repeer, and otherwise creep toward active

Actions #1

Updated by Sage Weil about 13 years ago

  • Assignee set to Sage Weil
Actions #2

Updated by Sage Weil about 13 years ago

  • Subject changed from osds failing during first cluster startup on sepia to osd: failing to send heartbeats (btrfs hang?)

I think this is the same thing Jim is seeing.

Actions #3

Updated by Sage Weil about 13 years ago

Fixed a few different bugs in this area, although we haven't specifically figured out why Jim was seeing that weird hang in his workload.

Actions #4

Updated by Sage Weil about 13 years ago

  • Target version changed from v0.25 to v0.25.1
Actions #5

Updated by Sage Weil about 13 years ago

  • Target version changed from v0.25.1 to v0.25.2
Actions #6

Updated by Sage Weil about 13 years ago

  • Translation missing: en.field_position set to 2
Actions #7

Updated by Sage Weil about 13 years ago

  • Translation missing: en.field_story_points set to 3
  • Translation missing: en.field_position deleted (4)
  • Translation missing: en.field_position set to 4
Actions #8

Updated by Alexandre Oliva about 13 years ago

I noticed that btrfs tends to freeze quite often when I use a journal within the osd tree, or within a separate ods's filesystem. Not using a journal seems to alleviate the problem somewhat, but it still freezes every now and then. Kernels that exhibited this behavior, with ceph 0.24.3 and 0.25.1, were:

kernel-libre-2.6.35.10-74.fc14.1.x86_64
kernel-libre-2.6.35.11-83.fc14.x86_64
kernel-libre-2.6.38-0.rc8.git2.1.fc15.x86_64

A kernel that did NOT exhibit this problem (tested with 0.24.3 only) was:

kernel-libre-2.6.34.8-68.fc13.x86_64

I happen to maintain both the mon and an osd tree on a btrfs shared with other data, and other osds in stand-alone filesystems, and the problem seems to occur more often on the shared ones, particularly those most often modified by applications that take care of fsyncing files. That's all I've been able to gather so far.

Actions #9

Updated by Sage Weil about 13 years ago

  • Target version changed from v0.25.2 to v0.25.3
Actions #10

Updated by Sage Weil about 13 years ago

  • Target version changed from v0.25.3 to v0.27
  • Translation missing: en.field_position deleted (3)
  • Translation missing: en.field_position set to 323
Actions #11

Updated by Sage Weil about 13 years ago

  • Translation missing: en.field_story_points changed from 3 to 5
  • Translation missing: en.field_position deleted (323)
  • Translation missing: en.field_position set to 323
Actions #12

Updated by Sage Weil about 13 years ago

  • Status changed from New to Closed

We're chalking this up to insufficient CPU to handle all of the threads. There are some tricks we could play (bumping thread priorities and so forth) but they don't really address the core issue. Also, if we are too slow to handle IO, we are "failed" in some sense. Any response will need to be carefully thought out... :/

Actions

Also available in: Atom PDF