Osd - prepopulate pg temp


Pre-populate the pg_temp mapping in the OSDMap when there are large changes in the CRUSH map.


  • Sage Weil (Inktank)

Interested Parties

  • Guang Yang (Yahoo!)
  • Name (Affiliation
  • Name

Current Status

Normally when there is a major change (like a CRUSH rule change, or reweighting of an entire rack), many PG primaries get remapped to devices that do not have the content, and each one sends a request to the monitor to add a pg_temp exception remapping to the previous location. This incurs a delay in availability, especially when there are many such PGs and a large number of messages the monitors have to process to add the remappings.

Detailed Description

Instead of waiting for the OSDs to add an exception, we could (optionally) prepopulate pg_temp after a CRUSH map change. This minimizes (or eliminates) any lapse in availability (no i/o stalls) at the expense of monitor CPU utilization calculating the mappings.
Key considerations:
  • what triggers the mon to calculate pg mappings? pg_pool_t property change? CRUSH map change?
  • how do we prevent that work from disrupting ongoing mon work?
    • async worker thread that may/may not come back with useful work before the paxos round gets proposed?
  • ensure that is making changes that trigger said remapping

Work items

Coding tasks

  1. mon: build predicate to determine when to calculate mappings
    1. add config options controlling this as appropriate
  2. mon: calculate mappings and pre-populate pg_temp
  3. mon: push calculation onto an async worker thread that can run in parallel with real work

Build / release tasks

  1. teuthology: ensure thrashosds exercises new feature