Bug #802: osd: failing to send heartbeats (btrfs hang?) - Ceph - Ceph

Actions

Copy link

Bug #802

closed

osd: failing to send heartbeats (btrfs hang?)

Added by Sage Weil about 13 years ago. Updated about 13 years ago.

Status:

Closed

Priority:

High

Assignee:

Sage Weil

Category:

OSD

Target version:

v0.27

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

mkcephfs -c /etc/ceph/ceph.conf --allhosts -k /etc/ceph/keyring.bin --mkbtrfs
service ceph -a start
watch osds flap and pgs (very) slowly create, peer, repeer, and otherwise creep toward active

Actions

Copy link

Updated by Sage Weil about 13 years ago

Assignee set to Sage Weil

Actions

Copy link

Updated by Sage Weil about 13 years ago

Subject changed from osds failing during first cluster startup on sepia to osd: failing to send heartbeats (btrfs hang?)

I think this is the same thing Jim is seeing.

Actions

Copy link

Updated by Sage Weil about 13 years ago

Fixed a few different bugs in this area, although we haven't specifically figured out why Jim was seeing that weird hang in his workload.

Actions

Copy link

Updated by Sage Weil about 13 years ago

Target version changed from v0.25 to v0.25.1

Actions

Copy link

Updated by Sage Weil about 13 years ago

Target version changed from v0.25.1 to v0.25.2

Actions

Copy link

Updated by Sage Weil about 13 years ago

Translation missing: en.field_position set to 2

Actions

Copy link

Updated by Sage Weil about 13 years ago

Translation missing: en.field_story_points set to 3
Translation missing: en.field_position deleted (4)
Translation missing: en.field_position set to 4

Actions

Copy link

Updated by Alexandre Oliva about 13 years ago

I noticed that btrfs tends to freeze quite often when I use a journal within the osd tree, or within a separate ods's filesystem. Not using a journal seems to alleviate the problem somewhat, but it still freezes every now and then. Kernels that exhibited this behavior, with ceph 0.24.3 and 0.25.1, were:

kernel-libre-2.6.35.10-74.fc14.1.x86_64
 kernel-libre-2.6.35.11-83.fc14.x86_64
 kernel-libre-2.6.38-0.rc8.git2.1.fc15.x86_64

A kernel that did NOT exhibit this problem (tested with 0.24.3 only) was:

kernel-libre-2.6.34.8-68.fc13.x86_64

I happen to maintain both the mon and an osd tree on a btrfs shared with other data, and other osds in stand-alone filesystems, and the problem seems to occur more often on the shared ones, particularly those most often modified by applications that take care of fsyncing files. That's all I've been able to gather so far.

Actions

Copy link

Updated by Sage Weil about 13 years ago

Target version changed from v0.25.2 to v0.25.3

Actions

Copy link

#10

Updated by Sage Weil about 13 years ago

Target version changed from v0.25.3 to v0.27
Translation missing: en.field_position deleted (3)
Translation missing: en.field_position set to 323

Actions

Copy link

#11

Updated by Sage Weil about 13 years ago

Translation missing: en.field_story_points changed from 3 to 5
Translation missing: en.field_position deleted (~~323~~)
Translation missing: en.field_position set to 323

Actions

Copy link

#12

Updated by Sage Weil about 13 years ago

Status changed from New to Closed

We're chalking this up to insufficient CPU to handle all of the threads. There are some tricks we could play (bumping thread priorities and so forth) but they don't really address the core issue. Also, if we are too slow to handle IO, we are "failed" in some sense. Any response will need to be carefully thought out... :/

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #802

osd: failing to send heartbeats (btrfs hang?)

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Alexandre Oliva about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago