Manual drive pull testing hangs filesystem
It appears that drive failure problems in my configuration are not making their way up through the stack to kill off OSDs.
- journal and data were two partition on same drive (for error isolation)
- 3 node cluster
- 4 SATA disks per node, cosd-per-disk config
- one partition from each disk in md (raid 1) root filesystem
- I/O generated continuously throughout testing
- kernel client running alongside daemons on all nodes
- running code is a 0.24rc with 2.6.37rc8 kernel
- cosd of pulled drive reported journal errors on raw device journal
- md root filesystem recognized failure and responded properly
- cosd servicing pulled drive did not die and began inflating memory usage
- ceph filesystem unresponsive (waited >> 10 minutes for ls response on client)
- with same setup, if cosd is killed soon after drive pull, no problems at all
- drive fail not being converted to cosd I/O error via btrfs, or I/O error ignored by cosd
- cosd memory inflation doesn't really matter, as cosd is expected to exit on error to allow re-peering
I can provide detailed hardware specs if it will help.
#1 Updated by Sage Weil almost 10 years ago
- Category set to OSD
- Priority changed from Normal to High
- Target version set to v0.25
Yep, this is a problem. The errors are causing btrfs operations to hang instead of return error codes.
What should the OSD do in this case? There should probably be a (long) timeout that will trigger a shutdown if the underlying fs becomes (very) unresponsive.
#3 Updated by Brian Chrisman almost 10 years ago
I have a Quarch box in the lab that I was just pointed to. It has an ssh interface to power cycle drives for failure testing (little shims that sit between the drive's sata/power and the system chassis, so I could automate some additional testing if we come up with a solution.
seems difficult to solve well without mandating an underlying filesystem. If btrfs is required for drive failure tolerance, then I imagine we can fix the btrfs handling of drive pull.
The only other way I can think of, is to have osds be informed of what actual devices they are using underneath, and then watching the sysfs entries for state changes on those drives. This would not detect/deal with filesystem level problems that aren't the result of a disk problem. For an integrated system/appliance, the sysfs monitoring would probably work. But for general use, something else needs to happen.
To keep the tradition of allowing any underlying filesystem, a load monitor/timeout should work. A large timeout would allow the system to become slow due to load without triggering an OSD exit. But it seems like a better solution would be to monitor activity and crash the osd only if there's no I/O completions and a request times out. I didn't notice the cosd's going into uninterruptible sleep (they were killable), so it'll probably work fairly cleanly.