Feature #62500: Warn on possible disk failures or high load on disks (OSDs) - bluestore - Ceph

Actions

Copy link

Feature #62500

open

Warn on possible disk failures or high load on disks (OSDs)

Added by Ponnuvel P 9 months ago. Updated 3 months ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

In a user environment, we have recently hit an issue in that they've had a bad disk, with the OSD logging the following:
```
2023-07-13T15:17:15.149+0000 7f0681050d80 -1 bdev(0x55d05eff8380 /var/lib/ceph/osd/ceph-175/block) read stalled read 0x29f40370000~100000 (buffered) since 63410177.290546s, timeout is 5.000000s
```

However, this wasn't spotted for weeks as there's no discernible warning (a health warning or info in 'ceph health detail' for example).
This led to degradation of performance in the cluster before identifying the bad disk as the cause.

I think we can make this a health warning perhaps so that users are warned (i.e. appears in `ceph -s` and `ceph health detail`) of such issues on time.

There are also similar issues reported y BlueStore such as:

(list isn't comprehensive)

```
2023-07-14T03:31:00.715+0000 7fd75bb70700 0 bluestore(/var/lib/ceph/osd/ceph-175) log_latency_fn slow operation observed for _txc_committed_kv, latency = 12.028621219s, txc = 0x55a107c30f00
```

```
[..] log_latency_fn slow operation observed for upper_bound, latency = 6.25955s [..]
```

```
[..] log_latency slow operation observed for submit_transact
```

which are also potential candidates for the same as warning as they could imply bad disks.

The BlueStore messages can also happen when the OSDs are heavily loaded and/or rocksdb
compaction is stalling the disk temporarily or even some random scsi resets which can affect
OSDs even if a disk on a OSD node being affected isn't part of Ceph cluster (e.g. an OSD).

So there's a possibility of reporting false positives which needs to be considered carefully.

We could have "threshold" in that if an OSD hits these conditions, say, x times in a y period then that
would be useful too suggesting there's fundamental load issues that needs admin's attention
even if it's not a bad disk as such.

But overall I see the benefit of having info reported to the user in health status even if it means adding
two types of new warnings such as:
1. bad disk
2. OSD is slow/overloaded/node-underconfigured

I am looking to implement this. Raising this to track as well as solicit any suggestions/opinions. Thanks.