Actions
Bug #45943
closedCeph Monitor heartbeat grace period does not reset.
% Done:
0%
Source:
Development
Tags:
Backport:
nautilus,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
The heartbeat grace timer does not reset after cluster network is stable for multiple days.
Implement a mechanism to reset it.
See https://bugzilla.redhat.com/show_bug.cgi?id=1814057 for more details.
Updated by Sridhar Seshasayee almost 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 35490
Updated by Sridhar Seshasayee almost 4 years ago
Updates from testing the fix:
OSD failure before being marked down:
2020-06-22T10:16:28.046+0530 7fae6eb1e700 10 mon.a@0(leader).osd e77 osd.0 has 1 reporters, 20.747132 grace (20.000000 + 0.747132 + 0), max_failed_since 2020-06-22T10:16:06.045063+0530
2020-06-22T10:16:28.169+0530 7fae71323700 10 mon.a@0(leader).osd e77 osd.0 has 1 reporters, 20.747061 grace (20.000000 + 0.747061 + 0), max_failed_since 2020-06-22T10:16:06.045063+0530
2020-06-22T10:16:30.476+0530 7fae6eb1e700 10 mon.a@0(leader).osd e77 osd.0 has 2 reporters, 20.745981 grace (20.000000 + 0.745981 + 0), max_failed_since 2020-06-22T10:16:06.473724+0530
2020-06-22T10:16:30.476+0530 7fae6eb1e700 0 log_channel(cluster) log [INF] : osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 24.003151 >= grace 20.745981)
2020-06-22T10:16:30.479+0530 7fae6eb1e700 10 mon.a@0(leader).log v1243 logging 2020-06-22T10:16:30.477798+0530 mon.a (mon.0) 340 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 24.003151 >= grace 20.745981)
2020-06-22T10:16:31.353+0530 7fae71323700 10 mon.a@0(leader).osd e77 encode_pending pending_inc.modified: 2020-06-22T10:16:31.354203+0530
2020-06-22T10:16:31.359+0530 7fae71323700 10 mon.a@0(leader).osd e77 grace_interval_threshold_exceeded last_failed_interval = 28 grace_interval_threshold_secs = 900
2020-06-22T10:16:31.546+0530 7fae6d31b700 7 mon.a@0(leader).log v1244 update_from_paxos applying incremental log 1244 2020-06-22T10:16:30.477798+0530 mon.a (mon.0) 340 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 24.003151 >= grace 20.745981)
2020-06-22T10:16:32.385+0530 7fae71323700 10 mon.a@0(leader).osd e78 encode_pending pending_inc.modified: 2020-06-22T10:16:32.386958+0530
2020-06-22T10:16:32.906+0530 7fae6eb1e700 10 mon.a@0(leader).osd e79 laggy, now xi down_stamp 2020-06-22T10:16:31.354203+0530 laggy_probability 0.83193 laggy_interval 1 old_weight 0 last_purged_snaps_scrub 2020-06-22T09:36:26.557735+0530 dead_epoch 78
----NO FAILURES FOR 900 secs (Threshold time was reduced to enable faster testing)-----
OSD Failures introduced again after 900 secs:
2020-06-22T11:10:07.794+0530 7fae6eb1e700 10 mon.a@0(leader).osd e81 osd.0 has 1 reporters, 20.817951 grace (20.000000 + 0.817952 + 0), max_failed_since 2020-06-22T11:09:45.793483+0530
2020-06-22T11:10:09.508+0530 7fae71323700 10 mon.a@0(leader).osd e81 osd.0 has 1 reporters, 20.816872 grace (20.000000 + 0.816873 + 0), max_failed_since 2020-06-22T11:09:45.793483+0530
2020-06-22T11:10:12.515+0530 7fae6eb1e700 10 mon.a@0(leader).osd e81 osd.0 has 2 reporters, 20.816063 grace (20.000000 + 0.816064 + 0), max_failed_since 2020-06-22T11:09:47.512356+0530
2020-06-22T11:10:12.515+0530 7fae6eb1e700 0 log_channel(cluster) log [INF] : osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 25.002302 >= grace 20.816063)
2020-06-22T11:10:12.518+0530 7fae6eb1e700 10 mon.a@0(leader).log v2801 logging 2020-06-22T11:10:12.516583+0530 mon.a (mon.0) 366 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 25.002302 >= grace 20.816063)
2020-06-22T11:10:12.763+0530 7fae71323700 10 mon.a@0(leader).osd e81 encode_pending pending_inc.modified: 2020-06-22T11:10:12.764331+0530
Grace threshold exceeded and laggy params are reset:
2020-06-22T11:10:12.776+0530 7fae71323700 10 mon.a@0(leader).osd e81 grace_interval_threshold_exceeded last_failed_interval = 3221 grace_interval_threshold_secs = 900
2020-06-22T11:10:12.776+0530 7fae71323700 20 mon.a@0(leader).osd e81 reset laggy, now xi down_stamp 2020-06-22T11:10:12.764331+0530 laggy_probability 0 laggy_interval 0 old_weight 0 last_purged_snaps_scrub 2020-06-22T09:36:26.557735+0530 dead_epoch 78
2020-06-22T11:10:12.794+0530 7fae6d31b700 7 mon.a@0(leader).log v2802 update_from_paxos applying incremental log 2802 2020-06-22T11:10:12.516583+0530 mon.a (mon.0) 366 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 25.002302 >= grace 20.816063)
2020-06-22T11:10:13.630+0530 7fae6eb1e700 10 mon.a@0(leader).osd e82 laggy, now xi down_stamp 2020-06-22T11:10:12.764331+0530 laggy_probability 0.3 laggy_interval 0 old_weight 0 last_purged_snaps_scrub 2020-06-22T09:36:26.557735+0530 dead_epoch 82
2020-06-22T11:10:13.864+0530 7fae71323700 10 mon.a@0(leader).osd e82 encode_pending pending_inc.modified: 2020-06-22T11:10:13.865337+0530
2020-06-22T11:10:35.469+0530 7fae6eb1e700 10 mon.a@0(leader).osd e84 osd.0 has 1 reporters, 20.000000 grace (20.000000 + 0 + 0), max_failed_since 2020-06-22T11:10:14.468282+0530
2020-06-22T11:10:36.015+0530 7fae6eb1e700 10 mon.a@0(leader).osd e84 osd.0 has 2 reporters, 20.000000 grace (20.000000 + 0 + 0), max_failed_since 2020-06-22T11:10:15.015638+0530
2020-06-22T11:10:36.015+0530 7fae6eb1e700 0 log_channel(cluster) log [INF] : osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 21.000461 >= grace 20.000000)
2020-06-22T11:10:36.016+0530 7fae6eb1e700 10 mon.a@0(leader).log v2813 logging 2020-06-22T11:10:36.016457+0530 mon.a (mon.0) 379 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 21.000461 >= grace 20.000000)
2020-06-22T11:10:36.627+0530 7fae71323700 10 mon.a@0(leader).osd e84 encode_pending pending_inc.modified: 2020-06-22T11:10:36.628141+0530
2020-06-22T11:10:36.694+0530 7fae6d31b700 7 mon.a@0(leader).log v2814 update_from_paxos applying incremental log 2814 2020-06-22T11:10:36.016457+0530 mon.a (mon.0) 379 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 21.000461 >= grace 20.000000)
2020-06-22T11:10:38.720+0530 7fae6eb1e700 10 mon.a@0(leader).osd e86 laggy, now xi down_stamp 2020-06-22T11:10:36.628141+0530 laggy_probability 0.51 laggy_interval 0 old_weight 0 last_purged_snaps_scrub 2020-06-22T09:36:26.557735+0530 dead_epoch 86
Updated by Sridhar Seshasayee almost 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Sridhar Seshasayee almost 4 years ago
- Copied to Backport #46229: octopus: Ceph Monitor heartbeat grace period does not reset. added
Updated by Sridhar Seshasayee almost 4 years ago
- Copied to Backport #46228: nautilus: Ceph Monitor heartbeat grace period does not reset. added
Updated by Nathan Cutler almost 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
Actions