Project

General

Profile

Feature #2944

Updated by Sage Weil over 11 years ago

Basically: 
 1) Keep track of when an OSD boots if it reports itself as fresh or as 
 wrongly-marked-down. Maintain the probability that the OSD is actually 
 down versus laggy based on that data and an exponential decay (more 
 recent reports matter more), and maintain the length of time the OSD 
 was laggy for in those cases. 
 2) When a sufficient number of failure reports come in to mark an OSD 
 down, additionally compute the laggy probability and laggy interval 
 for the reporters in aggregate. 
 3) Adjust the "heartbeat grace" locally on the monitor according to 
 the following formula: 
     adjusted_heartbeat_grace = heartbeat_grace + laggy_interval * (1 / 
 laggy_probability) + group_laggy_interval * ( 1 / 
 group_laggy_probability) 
 4) If we reach the end of that adjusted heartbeat grace, and we have 
 not received failure cancellations (which already exist; when an OSD 
 gets a heartbeat from a node it's reported down but which isn't marked 
 down, the OSD sends a cancellation), then mark the OSD down. 
 5) When running the out check, adjust the "down to out interval" by 
 the same ratio we've adjusted the heartbeat grace by. 

Back