Project

General

Profile

Actions

Bug #54182

open

OSD_TOO_MANY_REPAIRS cannot be cleared in >=Octopus

Added by Christian Rohmann about 2 years ago. Updated about 2 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
% Done:

0%


Description

The newly added warning OSD_TOO_MANY_REPAIRS (https://tracker.ceph.com/issues/41564) is raised on a certain count of scrub errors on an OSD.

While a very good and helpful feature, there currently is no way to reset this counter in Octopus or later version of Ceph. According to the documentation only MUTING is intended to be used to (temporarily) silence the warnings.
As this might make sense for scrub errors originating from hardware issues, but the same stat/counter of an OSD is also used for omap errors / inconsistencies.

Side note: I currently have two clusters handling RADOSGW multi-site with many random omap_digest_mismatch errors - https://tracker.ceph.com/issues/53663 - so some of my OSDs now already have exceeded this count.

But also other users have ran into this, see e.g. https://www.mail-archive.com/ceph-users@ceph.io/msg06780.html

But regarding the the general issue: While I could just mute this type of health warning or raise the counter globally, this solely beats the purpose and does not allow to target an individual OSD in any way. So by increasing the counter I would also "raise the bar" for other errors or other OSDs having errors.

Actually there even was an command to clear and even set the repair count of an OSD via:

 osd tell osd.# clear_shards_repaired [count]

but that was only implemented for Nautilus (due to the mute feature not being available yet).
See https://github.com/ceph/ceph/commit/1a63a63d411457173670c230b1484a74fef9104a

Please kindly consider (re-)adding this.

Actions #1

Updated by Neha Ojha about 2 years ago

  • Tags set to low-hanging-fruit
  • Backport set to quincy,pacific, octopus

We can include clear_shards_repaired in master and backport it.

Actions #2

Updated by Christian Rohmann almost 2 years ago

I just observed this issue once more and forgot to drop the info that a restart of an OSD actually resets this counter. So this is NOT a persistent info and not counter over the whole lifetime of an OSD. This somewhat also beats the purpose in case OSDs are restartet quite regularly, as when applying OS or kernel updates or running cloud-natively via e.g. Rook on Kubernetes.

Actions #3

Updated by Laura Flores almost 2 years ago

  • Translation missing: en.field_tag_list set to low-hanging-fruit
Actions #4

Updated by Daniel R 2 months ago

I've published a PR to re-add this feature a couple months ago: https://github.com/ceph/ceph/pull/54954

Actions #5

Updated by Konstantin Shalygin 2 months ago

  • Status changed from New to Fix Under Review
  • Target version set to v19.0.0
  • Source set to Community (user)
  • Pull request ID set to 54954
Actions #6

Updated by Radoslaw Zarzynski 2 months ago

Bump up.

Actions #7

Updated by Radoslaw Zarzynski about 2 months ago

note from bug scrub: reviewed, changes requested.

Actions #8

Updated by Konstantin Shalygin about 2 months ago

  • Backport changed from quincy,pacific, octopus to squid reef quincy
Actions #9

Updated by Radoslaw Zarzynski about 2 months ago

Review in progress.

Actions

Also available in: Atom PDF