Project

General

Profile

Actions

Feature #10796

open

Schedule scrubbing by considering PG's last_scrub_timestamp globally

Added by Guang Yang about 9 years ago. Updated about 9 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Copy the email from @ceph-devel to open a tracker:

Hi Sage,
Another potential problem with scrub scheduling, as observed in our deployment (2PB cluster, 70% full), was that some PGs hadn't been scrubbed for 1.5 months, even we have the configuration to do deep scrubbing weekly.

With our deployment and percentage of full of the cluster, as well as the conservative setting for scrubbing (osd_max_scrubs = 1), one round of scrubbing would not finish in 1 one week, so that we properly should schedule that monthly (with weekly shallow scrubbing).

Another problem, is that currently the scheduling of scrub is optimized locally at each OSD, that is, for each PG this OSD acts as the primary, it selects the one which hasn't been scheduled to scrubbing longest, put it as the candidate and request scrub reserver from all replicas. Since each OSD can only have 1 active scrubbing, that active slot could potentially always occupied by a replica, as a result, the PG whose primary is this OSD, fail to schedule and left behind.

Is this issue worth an enhancement?
[sage] Good point. Yeah, I think it's definitely worth fixing!


Related issues 1 (1 open0 closed)

Related to RADOS - Feature #10931: osd: better scrub schedulingNew02/23/2015

Actions
Actions #1

Updated by Sage Weil about 9 years ago

  • Tracker changed from Bug to Feature
  • Target version set to v0.94
Actions #2

Updated by Guang Yang about 9 years ago

The easiest way I can think of is to add a timestamp field into the reservation request, and for the OSD handling the reservation request, it checks the timestamp against its own primary PGs, and reject the reservation request if has a later timestamp, so that the local tick can get a chance to schedule its own scrubbing?

Does that look correct?

Actions #3

Updated by Samuel Just about 9 years ago

  • Assignee set to Guang Yang
Actions #4

Updated by Samuel Just about 9 years ago

That sounds both simple and effective!

Actions #5

Updated by Guang Yang about 9 years ago

Samuel Just wrote:

That sounds both simple and effective!

https://github.com/ceph/ceph/pull/3733

Actions #6

Updated by Guang Yang about 9 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Samuel Just about 9 years ago

  • Target version changed from v0.94 to v0.95
Actions #8

Updated by Samuel Just about 9 years ago

  • Target version changed from v0.95 to v9.0.2
Actions #9

Updated by Samuel Just about 9 years ago

  • Target version deleted (v9.0.2)
Actions

Also available in: Atom PDF