Project

General

Profile

Bug #65371

Updated by Kamoltat (Junior) Sirivadhna about 1 month ago

I noticed that in the final stage Instead of the func PeeringState::calc_replicated_acting_stretch we are populating the acting_set before checking whether the next OSD we are about to add into the acting set will exceed the bucket_max we imposed for the number of OSDs that shares the same ancestor. This leads to a scenario where we would have 3 OSDs from the same data center. Here is the evidence from a local test I ran where we enter the stretch pool and 2 DCs are down: 

 <pre> 
 calc_replicated_acting_stretch 
 bucket_max: 2 
  osd 8 primary accepted 15.6( v 971'2 (0'0,971'2] local-lis/les=982/983 n=0 ec=964/964 lis/c=982/969 les/c/f=983/971/0 sis=984) 
  osd 8 (up) accepted 15.6( v 971'2 (0'0,971'2] local-lis/les=982/983 n=0 ec=964/964 lis/c=982/969 les/c/f=983/971/0 sis=984) 
  osd 6 (up) accepted 15.6( v 971'2 (0'0,971'2] local-lis/les=982/983 n=0 ec=964/964 lis/c=982/969 les/c/f=983/971/0 sis=984) 
 want: [8,6] 
 acting: [8,6,7] 
 ancestors: {-9=candidates[<>]} 
  up set insufficient, considering remaining osds 
  acting candidate 7 15.6( v 971'2 (0'0,971'2] local-lis/les=982/983 n=0 ec=964/964 lis/c=982/969 les/c/f=983/971/0 sis=984) 
  next: candidates[<0,971'2,7>] 
 pop_ancestor accepting candidate 7 
 want is now: [8,6,7] 
 acting_backfill is now: 6,7,8 
  num_selected: 3 
 </pre> 

 Now, we actually will get away with this because in: 

 <pre> 
  bool acting_set_writeable() { 
    return (actingset.size() >= pool.info.min_size) && 
      (pool.info.stretch_set_can_peer(acting, *get_osdmap(), NULL)); 
  } 
 </pre> 

 actingset.size() is definitely >= pool.info.min_size    (assuming min_size=3) 
 We only go active if `stretch_set_can_peer` also returns True, which guess what … it will return False 

 Therefore, instead of this: 

 <pre> 

  while (!aheap.is_empty() && want->size() < pool.info.size) { 
     auto next = aheap.pop(); 
     pop_ancestor(next.get()); 
     if (next.get().get_num_selected() < bucket_max) { 
       aheap.push_if_nonempty(next); 
     } 
   } 
 </pre> 

 we should do this: do: 

 <pre> 
   while (!aheap.is_empty() && want->size() < pool.info.size) { 
     auto next = aheap.pop(); 
     if (next.get().get_num_selected() < bucket_max) { 
       pop_ancestor(next.get()); 
       aheap.push_if_nonempty(next); 
     } 
   } 
 </pre>

Back