Project

General

Profile

Actions

Bug #735

closed

Manual drive pull testing hangs filesystem

Added by Brian Chrisman over 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It appears that drive failure problems in my configuration are not making their way up through the stack to kill off OSDs.
Setup:
- journal and data were two partition on same drive (for error isolation)
- 3 node cluster
- 4 SATA disks per node, cosd-per-disk config
- one partition from each disk in md (raid 1) root filesystem
- I/O generated continuously throughout testing
- kernel client running alongside daemons on all nodes
- running code is a 0.24rc with 2.6.37rc8 kernel
Symptoms:
- cosd of pulled drive reported journal errors on raw device journal
- md root filesystem recognized failure and responded properly
- cosd servicing pulled drive did not die and began inflating memory usage
- ceph filesystem unresponsive (waited >> 10 minutes for ls response on client)
- with same setup, if cosd is killed soon after drive pull, no problems at all
My Theory(ies):
- drive fail not being converted to cosd I/O error via btrfs, or I/O error ignored by cosd
- cosd memory inflation doesn't really matter, as cosd is expected to exit on error to allow re-peering

I can provide detailed hardware specs if it will help.

Actions

Also available in: Atom PDF