Project

General

Profile

Actions

Feature #231

closed

Slow OSDs shouldn't destroy cluster performance

Added by Greg Farnum almost 14 years ago. Updated over 12 years ago.

Status:
Rejected
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Wido was testing on an 8-OSD cluster and getting only ~25MB/s out of Ceph. Running OSD self tests revealed that a number of them were slower than the others (3 ~55MB/s, 1 ~48MB/s, 3@~35MB/s, 1@~12MB/s), and by kicking out the slower half his performance rose to 75MB/s.

It's probably not a good area to focus on now, but eventually we need to:
1) Profile how slow OSDs cause cluster behavior to change, assess their impact.
2) Come up with some way of detecting and notifying about slow OSDs, even if it's a manual command admins can run when they start seeing slow results
3) Possibly develop routines for automatically working around these slower OSDs, or at least minimizing the impact they can have (maybe keep track of how long replicas take to reply and send data to slower replicas first instead of blindly?).

This might be something the LLNL failure profile people can help with.

Actions

Also available in: Atom PDF