Osd - Scrub and Repair


Current scrub and repair is fairly primitive. There are several improvements which need to be made:
1) There needs to be a way to query the results of the most recent scrub on a pg.
2) The user should be able to query the contents of the replica objects in the event of an inconsistency (including data payload, xattrs, and omap). This probably coopts the existing replica read machinery.
3) The user should be able to specify which replica to use for repair using the above information.


  • Samuel Just (Red Hat)
  • Name (Affiliation)
  • Name

Interested Parties

  • Guang Yang (Yahoo!)
  • Loic Dachary (Red Hat)
  • Danny Al-Gaaf (Deutsche Telekom)
  • Shinobu Kinjo (Red Hat)
  • Name

Current Status

There are scrub and repair mechanisms, this blueprint aims to expand and improve them.

Detailed Description

On the osd side, the first change is that the primary needs to track the inconsistency information as scrub progresses. As this might involve a large number of objects (though probably not), we do not want to keep this in memory. I suggest storing the information in a per-pg scratch object which is cleared during peering reset. The machinery used for the SnapMapper object can be re-used to handle maintaining a cache of unstable keys.

Next, I suggest adding a librados interface to ferry that information out to the user:

/// get currently inconsistent pgs
void get_inconsistent_pgs(
  pg_t last,            ///< [in] list pgs > last
  list<pg_t> *out   ///< [out] listed pgs

/// get information about inconsistent objects in a pg
bool query_inconsistent_pg(
  pg_t to_query,                      ///< [in] pg to query
  pair<string, string> last,       ///< [in] begin listing objects > last, (locator, object)
  epoch_t *activation_epoch, ///< [out] activation epoch for the interval in which this query was serviced
  list<pair<pair<string, string>, inconsistent_info_t> *out ///< [out] listed inconsistency information
  ); ///< @return true iff primary has a populated scrub info structure

inconsistent_info_t will include all relevant information about each inconsistent object.

/// allows directed repair of an object
void repair_inconsistent_object(
  pg_t pg_to_repair,                   ///< [in] pg in which we want to perform a repair
  pair<string, string> to_repair, ///< [in] object to repair
  replica_t replica_to_use,        ///< [in] replica to use for repair
  epoch_t activation_epoch,     ///< [in] from query_inconsistent_pg
  ...                                              ///< stuff to support async?

The repair_inconsistent_object return machinery will return an -EAGAIN (or some other error) if activation_epoch is prior to the primary's current interval activation_epoch. This ensures that the replica_to_use value is not out of date.

We'll also need interfaces to allow reading based on replica_t, but we don't want to duplicate all of the current read interfaces. Possibly a ioctx method to set replica_to_use, pg_to_repair, and activation_epoch?