Bug #735: Manual drive pull testing hangs filesystem - Ceph - Ceph

Actions

Copy link

Bug #735

closed

Manual drive pull testing hangs filesystem

Added by Brian Chrisman about 13 years ago. Updated about 13 years ago.

Status:

Resolved

Priority:

High

Assignee:

Colin McCabe

Category:

OSD

Target version:

v0.25

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

It appears that drive failure problems in my configuration are not making their way up through the stack to kill off OSDs.
Setup:
- journal and data were two partition on same drive (for error isolation)
- 3 node cluster
- 4 SATA disks per node, cosd-per-disk config
- one partition from each disk in md (raid 1) root filesystem
- I/O generated continuously throughout testing
- kernel client running alongside daemons on all nodes
- running code is a 0.24rc with 2.6.37rc8 kernel
Symptoms:
- cosd of pulled drive reported journal errors on raw device journal
- md root filesystem recognized failure and responded properly
- cosd servicing pulled drive did not die and began inflating memory usage
- ceph filesystem unresponsive (waited >> 10 minutes for ls response on client)
- with same setup, if cosd is killed soon after drive pull, no problems at all
My Theory(ies):
- drive fail not being converted to cosd I/O error via btrfs, or I/O error ignored by cosd
- cosd memory inflation doesn't really matter, as cosd is expected to exit on error to allow re-peering

I can provide detailed hardware specs if it will help.

Actions

Copy link

Updated by Sage Weil about 13 years ago

Category set to OSD
Priority changed from Normal to High
Target version set to v0.25

Yep, this is a problem. The errors are causing btrfs operations to hang instead of return error codes.

What should the OSD do in this case? There should probably be a (long) timeout that will trigger a shutdown if the underlying fs becomes (very) unresponsive.

Actions

Copy link

Updated by Greg Farnum about 13 years ago

Shouldn't btrfs be able to detect that the disk is gone and return appropriate error codes itself, rather than hanging?

Actions

Copy link

Updated by Brian Chrisman about 13 years ago

I have a Quarch box in the lab that I was just pointed to. It has an ssh interface to power cycle drives for failure testing (little shims that sit between the drive's sata/power and the system chassis, so I could automate some additional testing if we come up with a solution.
seems difficult to solve well without mandating an underlying filesystem. If btrfs is required for drive failure tolerance, then I imagine we can fix the btrfs handling of drive pull.

The only other way I can think of, is to have osds be informed of what actual devices they are using underneath, and then watching the sysfs entries for state changes on those drives. This would not detect/deal with filesystem level problems that aren't the result of a disk problem. For an integrated system/appliance, the sysfs monitoring would probably work. But for general use, something else needs to happen.

To keep the tradition of allowing any underlying filesystem, a load monitor/timeout should work. A large timeout would allow the system to become slow due to load without triggering an OSD exit. But it seems like a better solution would be to monitor activity and crash the osd only if there's no I/O completions and a request times out. I didn't notice the cosd's going into uninterruptible sleep (they were killable), so it'll probably work fairly cleanly.

Actions

Copy link

Updated by Colin McCabe about 13 years ago

Assignee set to Colin McCabe

Actions

Copy link

Updated by Colin McCabe about 13 years ago

We need to be ready to handle unresponsive FileStores in general. Even if the underlying filesystem is 100% perfect (ha ha), the hardware itself may have problems which cause unresponsiveness. So we need to handle it in our code.

Actions

Copy link