Project

General

Profile

Bug #64802

Updated by Samuel Just 2 months ago

PeeringState::calc_replicated_acting_stretch encodes special behavior for stretch clusters which prohibits the primary from selecting a pg_temp which packs too many OSDs into the same crush bucket. 

 An example of the scenario we're worried about here would be the following sequence of events on a cluster with 6 replicas spread across 3 DCs with a min_size of 3 intended to prevent the pg from going active with a single dc: 
 1. pg is originally mapped to osds [a1, a2, b1, b2, c1, c2] where a1 and a2 are in dc a and b1, b2 are in dc b and c1, c2 are in dc c. 
 2. dcs a and b become unavailable (let's say power failed completely) 
 3. the pg acting set becomes [c1, c2] 
 4. the raw mapping (for whatever reason) shifts to [a1, a2, b1, b2, c1, c3] resulting in an acting set of [c1, c3] 
 5. c1 requests a temp mapping for [c1, c2] while backfilling [c3] 
 6. upon backfill completion, c1 requests a temp mapping of [c1, c3, c2].    Because the acting set now has 3 members, the pg goes active and accepts writes 
 7. dc c loses power, but dcs a and b come back online 

 The user expectation is that with dcs a and b back, the cluster should be fully available (though degraded). if degraded.    The actual outcome is that the above pg would be unable to complete peering because there would be an interval that might have gone active from which no osds can be contacted. 

 Stretch mode has a solution for this problem.    PeeringState::calc_replicated_acting_stretch considers a few additional parameters to avoid the above situation: 
 - pg_pool_t::peering_crush_bucket_barrier 
 - pg_pool_t::peering_crush_bucket_target 

 pg_pool_t::peering_crush_bucket_target is the intended number of buckets of type pg_pool_t::peering_crush_bucket_barrier over which the acting set should be split.    It uses these two and pg_pool_g::size to derive bucket_max, the maximum allowable number of acting set members from any single bucket of type pg_pool_t::peering_crush_bucket_barrier (note to reader: please confirm everything I've indicated here independently :). 

 So far so good.    The problem is that there is no way to set these values independently of the larger stretch_mode feature.    OSDMonitor::prepare_new_pool and OSDMonitor::try_enable_stretch_mode together ensure that these values are set uniformly on all pools if the cluster as a whole has stretch mode enabled (OSDMap::stretch_mode_enabled). 

 I think the solution is likely as simple as allowing the user to specify pg_pool_t::peering_crush_bucket_barrier and pg_pool_t::peering_crush_bucket_target upon pool creation or afterwards without enabling stretch mode, but testing will need to be done (and added to teuthology!!) to ensure that there aren't hidden assumptions linking these to stretch mode. 

 Steps: 
 - audit code to ensure that the above assumptions are correct 
 - list areas that need to be changed 
 - create plan for teuthology test 
 - create test, confirm that it fails without changes 
 - create code changes

Back