Feature #49505
Warn about extremely anomalous commit_latencies
0%
Description
In a EC cluster with ~500 hdd osds, we suffered a drop in write performance from 30GiB/s down to 3GiB/s due to one sick drive:
# ceph osd perf | sort -n -k3 | tail 78 50 50 427 50 50 85 52 52 227 52 52 335 53 53 252 54 54 30 57 57 455 59 59 186 64 64 256 2306 2306
Moments after stopping osd.256 the write performance returned to normal, 30GiB/s.
While debugging this, ceph status was reporting plenty of slow requests on 10s-100s of OSDs, but there were no other warnings which might have narrowed in on osd.256 as having a problem. (E.g. the network ping maps were all OK apparently). We thought to check `osd perf` and voila we found the sick drive like that. (And btw, this hdd has a healthy SMART status -- so in this case we suspect that the high latency is due to a poor SATA connection).
To make it simpler to find this type of problem, should we raise a health warn when an OSD has commit_latency exceeding N times the mean, e.g. 10x ?
History
#1 Updated by Laura Flores 8 months ago
- Tags set to low-hanging-fruit
#2 Updated by Laura Flores 8 months ago
- Tags set to low-hanging-fruit
- Tags deleted (
low-hanging-fruit)