Feature #231
closedSlow OSDs shouldn't destroy cluster performance
0%
Description
Wido was testing on an 8-OSD cluster and getting only ~25MB/s out of Ceph. Running OSD self tests revealed that a number of them were slower than the others (3 ~55MB/s, 1
~48MB/s, 3@~35MB/s, 1@~12MB/s), and by kicking out the slower half his performance rose to 75MB/s.
It's probably not a good area to focus on now, but eventually we need to:
1) Profile how slow OSDs cause cluster behavior to change, assess their impact.
2) Come up with some way of detecting and notifying about slow OSDs, even if it's a manual command admins can run when they start seeing slow results
3) Possibly develop routines for automatically working around these slower OSDs, or at least minimizing the impact they can have (maybe keep track of how long replicas take to reply and send data to slower replicas first instead of blindly?).
This might be something the LLNL failure profile people can help with.