Project

General

Profile

Feature #52609

New PG states for pending scrubs / repairs

Added by Michael Kidd over 2 years ago. Updated over 2 years ago.

Status:
Fix Under Review
Priority:
Normal
Category:
Scrub/Repair
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
pgmap
Pull request ID:

Description

Request to add new PG states to provide feedback to the admin when a PG scrub/repair is scheduled ( via command line, or a triggered deep-scrub after inconsistency identified in scrub ). At this time, it's difficult (impossible without looking at debug logs) to tell when and which PG scrub/repair is scheduled which can lead to admin confusion if their issued command is being honored by the cluster.

Possible new states:
deepscrub_wait
scrub_wait
repair_wait

History

#1 Updated by Ronen Friedman over 2 years ago

Is the following a good enough solution?

Neha Ojha, Josh Durgin, Sam Just - what do you think?

I have drafted a modification to the (post- PR#40984 merge) code:
the '.scrubber' section in the 'pg xx query' command changes
to include scheduling data. Specifically:

(https://github.com/ronen-fr/ceph/commit/354ae990b489eeb717fccfdfde56bd06a403721d )

When the scrubber is not active:
bin/ceph pg 2.4 query | jq '.scrubber'

  {
    "active": false,
    "must_scrub": false,
    "must_deep_scrub": false,
    "must_repair": false,
    "need_auto": false,
    "scrub_reg_stamp": "2021-09-26T03:24:51.429331+0000",
    "schedule": "scrub scheduled @ 2021-09-26T03:24:51.429331+0000" 
  }

Note - no 'is-time-for-deep', as this flag is only ever set when scrubbing
starts.

After a deep-scrub request (but before the 'scrub' command):

{
  "active": false,
  "must_scrub": false,
  "must_deep_scrub": false,
  "must_repair": false,
  "need_auto": false,
  "scrub_reg_stamp": "2021-09-26T12:41:21.856538+0000",
  "schedule": "deep scrub scheduled @ 2021-09-26T12:41:21.856538+0000" 
}

When ripe for scrubbing (either because of one of the 'must' flags, or
when reaching the scheduled time):

{
  "active": false,
  "must_scrub": false,
  "must_deep_scrub": false,
  "must_repair": false,
  "need_auto": false,
  "scrub_reg_stamp": "2021-09-19T19:51:47.491221+0000",
  "schedule": "queued for scrub" 
}

or

{
  "active": false,
  "must_scrub": false,
  "must_deep_scrub": false,
  "must_repair": false,
  "need_auto": false,
  "scrub_reg_stamp": "2021-09-19T12:10:00.013855+0000",
  "schedule": "queued for deep scrub" 
}

When deep-scrubbing:

{
  "active": true,
  "epoch_start": "9",
  "start": "2:a0000000::::0",
  "end": "2:a0000000::::0",
  "m_max_end": "MIN",
  "subset_last_update": "0'0",
  "deep": true,
  "req_scrub": false,
  "auto_repair": false,
  "check_repair": false,
  "deep_scrub_on_error": false,
  "priority": 5,
  "shallow_errors": 0,
  "deep_errors": 0,
  "fixed": 0,
  "waiting_on_whom": []
}

#2 Updated by Samuel Just over 2 years ago

That schedule element seems like a pretty reasonable human-readable summary.

#3 Updated by Michael Kidd over 2 years ago

While I agree the format is readable, it's a bit narrow in application.

Would it be a significant undertaking to:
a: Add a new '_wait' or '_scheduled' state to the PG status so it's visible in `ceph -s`?
b: Add a new column to `ceph pg dump` which could provided the scheduled state?

Either of these will greatly increase the visibility for an admin. As it stands, the admin would potentially need to 'query' every PG to get a summary for reporting / alerting purposes.

Thanks!

#4 Updated by Neha Ojha over 2 years ago

  • Status changed from New to In Progress
  • Pull request ID set to 43403

#5 Updated by Ronen Friedman over 2 years ago

  • Status changed from In Progress to Fix Under Review

Please see the updated proposed change (https://github.com/ceph/ceph/pull/43403 - new comment from today).
I hope it answers the request.

That PR is ready for review.

#6 Updated by Michael Kidd over 2 years ago

Looks good to me! Thanks Ronen

Also available in: Atom PDF