Feature #49505: Warn about extremely anomalous commit_latencies - RADOS - Ceph

Actions

Copy link

Feature #49505

open

Warn about extremely anomalous commit_latencies

Added by Dan van der Ster about 3 years ago. Updated almost 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Ceph - v15.2.8

Component(RADOS):

Pull request ID:

Description

In a EC cluster with ~500 hdd osds, we suffered a drop in write performance from 30GiB/s down to 3GiB/s due to one sick drive:

# ceph osd perf | sort -n -k3 | tail
 78                  50                 50
427                  50                 50
 85                  52                 52
227                  52                 52
335                  53                 53
252                  54                 54
 30                  57                 57
455                  59                 59
186                  64                 64
256                2306               2306

Moments after stopping osd.256 the write performance returned to normal, 30GiB/s.

While debugging this, ceph status was reporting plenty of slow requests on 10s-100s of OSDs, but there were no other warnings which might have narrowed in on osd.256 as having a problem. (E.g. the network ping maps were all OK apparently). We thought to check `osd perf` and voila we found the sick drive like that. (And btw, this hdd has a healthy SMART status -- so in this case we suspect that the high latency is due to a poor SATA connection).

To make it simpler to find this type of problem, should we raise a health warn when an OSD has commit_latency exceeding N times the mean, e.g. 10x ?