Bug #57852
osd: unhealthy osd cannot be marked down in time
0%
Description
Before an unhealthy osd is marked down by mon, other osd may choose it as
heartbeat peer and then report an incorrect failure time(first_tx) to mon.
reproduce:
Shutdown cluster_network and public_network of an osd node several times.
History
#1 Updated by Radoslaw Zarzynski 4 months ago
- Status changed from New to Need More Info
Could you please clarify a bit? Do you mean there some extra, unnecessary (from the POV of jugging whether an OSD is down or not) messages that just update the markdown timestamp?
#2 Updated by wencong wan 4 months ago
Radoslaw Zarzynski wrote:
Could you please clarify a bit? Do you mean there some extra, unnecessary (from the POV of jugging whether an OSD is down or not) messages that just update the markdown timestamp?
Whether an OSD is down or not is determined by mon.If either of the following two conditions is met, mon will mark an osd as down.
1、Mon does not receive osd_beacon message of an osd for more than 900s(mon_osd_report_timeout)
2、Mon receive failure_report message from 2(mon_osd_min_down_reporters) osds on different host(mon_osd_reporter_subtree_level) and the fault lasted for a period of time(now - fi.get_failed_since() > grace).
get_failed_since return the max failed time of all reporters. if some osd choose the unhealthy osd as heartbeat peer,they will never receive heartbeat reply from the unhealthy osd. So these osds will report the first time of sending heartbeat as the failure time of the unhealthy osd. The condition "now - fi.get_failed_since() > grace" cannot be met.
#3 Updated by Radoslaw Zarzynski 3 months ago
- Status changed from Need More Info to New
For the detailed explanation!
#4 Updated by Radoslaw Zarzynski 3 months ago
- Assignee set to Prashant D
Not a something we introduced recently but still worth taking a look if nothing urgent is not the plate.
#5 Updated by Prashant D 26 days ago
Sure Radek. Let me have a look at this.
#6 Updated by Radoslaw Zarzynski 22 days ago
- Status changed from New to In Progress