Feature #2944
closedmon: dynamically adjust heartbeat grace
0%
Description
Basically:
1) Keep track of when an OSD boots if it reports itself as fresh or as
wrongly-marked-down. Maintain the probability that the OSD is actually
down versus laggy based on that data and an exponential decay (more
recent reports matter more), and maintain the length of time the OSD
was laggy for in those cases.
2) When a sufficient number of failure reports come in to mark an OSD
down, additionally compute the laggy probability and laggy interval
for the reporters in aggregate.
3) Adjust the "heartbeat grace" locally on the monitor according to
the following formula:
adjusted_heartbeat_grace = heartbeat_grace + laggy_interval * (1 /
laggy_probability) + group_laggy_interval * ( 1 /
group_laggy_probability)
4) If we reach the end of that adjusted heartbeat grace, and we have
not received failure cancellations (which already exist; when an OSD
gets a heartbeat from a node it's reported down but which isn't marked
down, the OSD sends a cancellation), then mark the OSD down.
5) When running the out check, adjust the "down to out interval" by
the same ratio we've adjusted the heartbeat grace by.
Updated by Sage Weil over 11 years ago
- Translation missing: en.field_position deleted (
1) - Translation missing: en.field_position set to 7
Updated by Sage Weil over 11 years ago
- Translation missing: en.field_story_points set to 21
- Translation missing: en.field_position deleted (
10) - Translation missing: en.field_position set to 8
Updated by Sage Weil over 11 years ago
- Translation missing: en.field_position deleted (
36) - Translation missing: en.field_position set to 3
Updated by Sage Weil over 11 years ago
- Translation missing: en.field_position deleted (
15) - Translation missing: en.field_position set to 1
Updated by Sage Weil over 11 years ago
- Translation missing: en.field_position deleted (
4) - Translation missing: en.field_position set to 1
Updated by Sage Weil over 11 years ago
- Translation missing: en.field_story_points changed from 21 to 0
- Translation missing: en.field_position deleted (
1) - Translation missing: en.field_position set to 1