Feature #231
closedSlow OSDs shouldn't destroy cluster performance
0%
Description
Wido was testing on an 8-OSD cluster and getting only ~25MB/s out of Ceph. Running OSD self tests revealed that a number of them were slower than the others (3 ~55MB/s, 1
~48MB/s, 3@~35MB/s, 1@~12MB/s), and by kicking out the slower half his performance rose to 75MB/s.
It's probably not a good area to focus on now, but eventually we need to:
1) Profile how slow OSDs cause cluster behavior to change, assess their impact.
2) Come up with some way of detecting and notifying about slow OSDs, even if it's a manual command admins can run when they start seeing slow results
3) Possibly develop routines for automatically working around these slower OSDs, or at least minimizing the impact they can have (maybe keep track of how long replicas take to reply and send data to slower replicas first instead of blindly?).
This might be something the LLNL failure profile people can help with.
Updated by Wido den Hollander over 13 years ago
Today I experienced a btrfs bug where [btrfs-transacti] got to status D and causing my OSD to hang (also go into status D).
Now, this seems like a btrfs bug (Was running 2.6.35, just upgraded to 2.6.37 to see if it comes back), but it exposes problem: I was installing a Virtual Machine which was running via qemu-kvm and the install stalled due to a hanging write to that particular OSD.
There could be numerous scenario's where I/O on a OSD could be hanging, so this should be handled in some way.
The solution in this case was rebooting the OSD (echo b > /proc/sysrq-trigger), but imho the OSD should "give" up when it notices it can't do I/O.
MDS'es give up right now when they notice their load is to high, can't we implement something like that in the OSD?