Project

General

Profile

Actions

Bug #54182

open

OSD_TOO_MANY_REPAIRS cannot be cleared in >=Octopus

Added by Christian Rohmann about 2 years ago. Updated about 2 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
% Done:

0%


Description

The newly added warning OSD_TOO_MANY_REPAIRS (https://tracker.ceph.com/issues/41564) is raised on a certain count of scrub errors on an OSD.

While a very good and helpful feature, there currently is no way to reset this counter in Octopus or later version of Ceph. According to the documentation only MUTING is intended to be used to (temporarily) silence the warnings.
As this might make sense for scrub errors originating from hardware issues, but the same stat/counter of an OSD is also used for omap errors / inconsistencies.

Side note: I currently have two clusters handling RADOSGW multi-site with many random omap_digest_mismatch errors - https://tracker.ceph.com/issues/53663 - so some of my OSDs now already have exceeded this count.

But also other users have ran into this, see e.g. https://www.mail-archive.com/ceph-users@ceph.io/msg06780.html

But regarding the the general issue: While I could just mute this type of health warning or raise the counter globally, this solely beats the purpose and does not allow to target an individual OSD in any way. So by increasing the counter I would also "raise the bar" for other errors or other OSDs having errors.

Actually there even was an command to clear and even set the repair count of an OSD via:

 osd tell osd.# clear_shards_repaired [count]

but that was only implemented for Nautilus (due to the mute feature not being available yet).
See https://github.com/ceph/ceph/commit/1a63a63d411457173670c230b1484a74fef9104a

Please kindly consider (re-)adding this.

Actions

Also available in: Atom PDF