Project

General

Profile

Actions

Bug #27988

closed

Warn if queue of scrubs ready to run exceeds some threshold

Added by David Zafman over 5 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The sched_scrub_pg set could be scanned during a new insert and the number of scrubs that are ready to be run could be counted and compared to some threshold. It would be nice if this triggered a monitor health warning.


Related issues 3 (0 open3 closed)

Related to RADOS - Bug #23576: osd: active+clean+inconsistent pg will not scrub or repairCan't reproduceDavid Zafman04/06/2018

Actions
Related to RADOS - Bug #37269: Prioritize user specified scrubsResolvedDavid Zafman11/14/2018

Actions
Related to RADOS - Bug #37264: scrub warning check incorrectly uses mon scrub intervalResolvedDavid Zafman11/14/2018

Actions
Actions #1

Updated by David Zafman over 5 years ago

  • Related to Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair added
Actions #2

Updated by David Zafman over 5 years ago

  • Subject changed from Warn if queue of scrubs exceeds some threshold to Warn if queue of scrubs ready to run exceeds some threshold
Actions #3

Updated by David Turner over 5 years ago

Talking with Sage, he believes there is already a warning status if you have scrubs that haven't run for more than 2x your interval. My experience in the related ticket was with a 30 day deep scrub interval and the repair was happening 3 weeks after it was issued. That indicates that I was within the existing 2x warning threshold but definitely beyond a healthy state.

Another idea that would help is to prioritize user submitted operations higher than automatically scheduled ones due to exceeding intervals.

Actions #4

Updated by David Zafman over 5 years ago

I'm want to fix 3 things here. First, user submitted scrubs are queued as due to occur immediately, but overdue scrubs are still prioritized before them. I want to have user submitted scrubs to run before all others. Second, I'd like to get a warning when too many scrubs are overdue. This could occur because too many user submitted scrubs are requested all at once, or because the system as configured can not keep up with the scrub demands. The could be disabled by default. Finally, the code to warn about overdue scrubs in the monitor is broken. It confuses the monitor's own scrubbing interval with pg scrubbing. It shouldn't use the mon_scrub_interval but rather osd_scrub_min_interval/osd_deep_scrub_interval when trying to assess how overdue scrubbing has gotten. Also, what about osd_scrub_max_interval? Also, should mon_warn_not_scrubbed and mon_warn_not_deep_scrubbed be renamed to mon_warn_pg_not_scrubbed and mon_warn_pg_not_deep_scrubbed respectively?

Actions #5

Updated by David Zafman over 5 years ago

  • Status changed from New to In Progress
Actions #6

Updated by David Zafman over 5 years ago

  • Related to Bug #37269: Prioritize user specified scrubs added
Actions #7

Updated by David Zafman over 5 years ago

  • Related to Bug #37264: scrub warning check incorrectly uses mon scrub interval added
Actions #8

Updated by David Zafman about 5 years ago

  • Status changed from In Progress to Need More Info

This is put on the back burner until we decide what to do next

Actions #9

Updated by David Zafman about 5 years ago

  • Pull request ID set to 23848
Actions #10

Updated by David Zafman over 3 years ago

  • Status changed from Need More Info to Rejected
  • Pull request ID deleted (23848)

This was already handled in a different but reasonable way by https://github.com/ceph/ceph/pull/15643 and refined by other changes.

Actions

Also available in: Atom PDF