Project

General

Profile

Feature #9328

osd: generalize the scrub workflow

Added by Loïc Dachary over 9 years ago. Updated over 4 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
OSD
Pull request ID:

Description

The scrub workflow collects information and use them. It starts when the PG enters scrubbing and ends with it. It would be convenient to generalize the workflow so that it can be called from all parts of the code that could contribute to it.

For instance ( that's what happened in #8914 ), when get_omap_iterator returns an empty pointer because the underlying file no longer exists in the FileStore, it should skip it and telle scrubbing about this inconsistency.


Related issues

Related to Ceph - Fix #8914: osd crashed at assert ReplicatedBackend::build_push_op Resolved 07/24/2014
Related to RADOS - Feature #4604: osd: read path should detect EIO and initiate repair New 04/01/2013
Related to Ceph - Bug #8588: In the erasure-coded pool, primary OSD will crash at decoding if any data chunk's size is changed Duplicate 06/11/2014

History

#1 Updated by Loïc Dachary over 9 years ago

  • Priority changed from Normal to High
<sjusthm> adapt build_push_op to do what scrub does when it discovers an inconsistent object
<sjusthm> ideally, you want to make that path fairly general so we can hook any other places where the osd sees something inconsistent into the same thing
<sjusthm> the basic concept is you will add the object to the primary's missing set
<sjusthm> and then run recovery on it
<sjusthm> making sure to update the missing_loc machinery
<sjusthm> you can look at the repair code for how that currently works
<sjusthm> after that, we will want to give the replicas a way to propogate "oh crap, I don't have this object I'm supposed to have" back to the primary where the primary can then mark its local copy of the replica missing set to be missing that object and run recovery
<sjusthm> the long term goal is for the primary and replicas whenever they find something obviously wrong with an object to be able to behave as if the inconsistency was discovered through scrub/repair
<sjusthm> instead of just crashing
<sjusthm> the path should also spam the central log since it might be an indication of a flaky disk
<sjusthm> the bug here is that build_push_op can't deal with an object which the missing set should be there isn't there
<sjusthm> the missing set does not contain that object

#2 Updated by Patrick Donnelly over 4 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from generalize the scrub workflow to osd: generalize the scrub workflow
  • Category deleted (OSD)
  • Start date deleted (09/03/2014)
  • Component(RADOS) OSD added

Also available in: Atom PDF