Bug #735: Manual drive pull testing hangs filesystem - Ceph - Ceph

Actions

Copy link

Bug #735

closed

Manual drive pull testing hangs filesystem

Added by Brian Chrisman over 13 years ago. Updated about 13 years ago.

Status:

Resolved

Priority:

High

Assignee:

Colin McCabe

Category:

OSD

Target version:

v0.25

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

It appears that drive failure problems in my configuration are not making their way up through the stack to kill off OSDs.
Setup:
- journal and data were two partition on same drive (for error isolation)
- 3 node cluster
- 4 SATA disks per node, cosd-per-disk config
- one partition from each disk in md (raid 1) root filesystem
- I/O generated continuously throughout testing
- kernel client running alongside daemons on all nodes
- running code is a 0.24rc with 2.6.37rc8 kernel
Symptoms:
- cosd of pulled drive reported journal errors on raw device journal
- md root filesystem recognized failure and responded properly
- cosd servicing pulled drive did not die and began inflating memory usage
- ceph filesystem unresponsive (waited >> 10 minutes for ls response on client)
- with same setup, if cosd is killed soon after drive pull, no problems at all
My Theory(ies):
- drive fail not being converted to cosd I/O error via btrfs, or I/O error ignored by cosd
- cosd memory inflation doesn't really matter, as cosd is expected to exit on error to allow re-peering

I can provide detailed hardware specs if it will help.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #735

Manual drive pull testing hangs filesystem

Updated by Sage Weil over 13 years ago

Updated by Greg Farnum over 13 years ago

Updated by Brian Chrisman over 13 years ago

Updated by Colin McCabe over 13 years ago

Updated by Colin McCabe over 13 years ago

Updated by Colin McCabe over 13 years ago

Updated by Sage Weil over 13 years ago

Updated by Colin McCabe about 13 years ago