Project

General

Profile

Actions

Bug #45943

closed

Ceph Monitor heartbeat grace period does not reset.

Added by Sridhar Seshasayee almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
nautilus,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The heartbeat grace timer does not reset after cluster network is stable for multiple days.
Implement a mechanism to reset it.

See https://bugzilla.redhat.com/show_bug.cgi?id=1814057 for more details.


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #46229: octopus: Ceph Monitor heartbeat grace period does not reset.ResolvedSridhar SeshasayeeActions
Copied to RADOS - Backport #46228: nautilus: Ceph Monitor heartbeat grace period does not reset.ResolvedSridhar SeshasayeeActions
Actions #1

Updated by Sridhar Seshasayee almost 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 35490
Actions #2

Updated by Sridhar Seshasayee almost 4 years ago

Updates from testing the fix:

OSD failure before being marked down:


2020-06-22T10:16:28.046+0530 7fae6eb1e700 10 mon.a@0(leader).osd e77  osd.0 has 1 reporters, 20.747132 grace (20.000000 + 0.747132 + 0), max_failed_since 2020-06-22T10:16:06.045063+0530
2020-06-22T10:16:28.169+0530 7fae71323700 10 mon.a@0(leader).osd e77  osd.0 has 1 reporters, 20.747061 grace (20.000000 + 0.747061 + 0), max_failed_since 2020-06-22T10:16:06.045063+0530
2020-06-22T10:16:30.476+0530 7fae6eb1e700 10 mon.a@0(leader).osd e77  osd.0 has 2 reporters, 20.745981 grace (20.000000 + 0.745981 + 0), max_failed_since 2020-06-22T10:16:06.473724+0530
2020-06-22T10:16:30.476+0530 7fae6eb1e700  0 log_channel(cluster) log [INF] : osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 24.003151 >= grace 20.745981)
2020-06-22T10:16:30.479+0530 7fae6eb1e700 10 mon.a@0(leader).log v1243  logging 2020-06-22T10:16:30.477798+0530 mon.a (mon.0) 340 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 24.003151 >= grace 20.745981)
2020-06-22T10:16:31.353+0530 7fae71323700 10 mon.a@0(leader).osd e77 encode_pending pending_inc.modified: 2020-06-22T10:16:31.354203+0530

2020-06-22T10:16:31.359+0530 7fae71323700 10 mon.a@0(leader).osd e77 grace_interval_threshold_exceeded last_failed_interval = 28 grace_interval_threshold_secs = 900

2020-06-22T10:16:31.546+0530 7fae6d31b700  7 mon.a@0(leader).log v1244 update_from_paxos applying incremental log 1244 2020-06-22T10:16:30.477798+0530 mon.a (mon.0) 340 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 24.003151 >= grace 20.745981)
2020-06-22T10:16:32.385+0530 7fae71323700 10 mon.a@0(leader).osd e78 encode_pending pending_inc.modified: 2020-06-22T10:16:32.386958+0530

2020-06-22T10:16:32.906+0530 7fae6eb1e700 10 mon.a@0(leader).osd e79  laggy, now xi down_stamp 2020-06-22T10:16:31.354203+0530 laggy_probability 0.83193 laggy_interval 1 old_weight 0 last_purged_snaps_scrub 2020-06-22T09:36:26.557735+0530 dead_epoch 78

----NO FAILURES FOR 900 secs (Threshold time was reduced to enable faster testing)-----

OSD Failures introduced again after 900 secs:


2020-06-22T11:10:07.794+0530 7fae6eb1e700 10 mon.a@0(leader).osd e81  osd.0 has 1 reporters, 20.817951 grace (20.000000 + 0.817952 + 0), max_failed_since 2020-06-22T11:09:45.793483+0530
2020-06-22T11:10:09.508+0530 7fae71323700 10 mon.a@0(leader).osd e81  osd.0 has 1 reporters, 20.816872 grace (20.000000 + 0.816873 + 0), max_failed_since 2020-06-22T11:09:45.793483+0530
2020-06-22T11:10:12.515+0530 7fae6eb1e700 10 mon.a@0(leader).osd e81  osd.0 has 2 reporters, 20.816063 grace (20.000000 + 0.816064 + 0), max_failed_since 2020-06-22T11:09:47.512356+0530
2020-06-22T11:10:12.515+0530 7fae6eb1e700  0 log_channel(cluster) log [INF] : osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 25.002302 >= grace 20.816063)
2020-06-22T11:10:12.518+0530 7fae6eb1e700 10 mon.a@0(leader).log v2801  logging 2020-06-22T11:10:12.516583+0530 mon.a (mon.0) 366 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 25.002302 >= grace 20.816063)
2020-06-22T11:10:12.763+0530 7fae71323700 10 mon.a@0(leader).osd e81 encode_pending pending_inc.modified: 2020-06-22T11:10:12.764331+0530

Grace threshold exceeded and laggy params are reset:

2020-06-22T11:10:12.776+0530 7fae71323700 10 mon.a@0(leader).osd e81 grace_interval_threshold_exceeded last_failed_interval = 3221 grace_interval_threshold_secs = 900
2020-06-22T11:10:12.776+0530 7fae71323700 20 mon.a@0(leader).osd e81  reset laggy, now xi down_stamp 2020-06-22T11:10:12.764331+0530 laggy_probability 0 laggy_interval 0 old_weight 0 last_purged_snaps_scrub 2020-06-22T09:36:26.557735+0530 dead_epoch 78
2020-06-22T11:10:12.794+0530 7fae6d31b700  7 mon.a@0(leader).log v2802 update_from_paxos applying incremental log 2802 2020-06-22T11:10:12.516583+0530 mon.a (mon.0) 366 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 25.002302 >= grace 20.816063)

2020-06-22T11:10:13.630+0530 7fae6eb1e700 10 mon.a@0(leader).osd e82  laggy, now xi down_stamp 2020-06-22T11:10:12.764331+0530 laggy_probability 0.3 laggy_interval 0 old_weight 0 last_purged_snaps_scrub 2020-06-22T09:36:26.557735+0530 dead_epoch 82
2020-06-22T11:10:13.864+0530 7fae71323700 10 mon.a@0(leader).osd e82 encode_pending pending_inc.modified: 2020-06-22T11:10:13.865337+0530

2020-06-22T11:10:35.469+0530 7fae6eb1e700 10 mon.a@0(leader).osd e84  osd.0 has 1 reporters, 20.000000 grace (20.000000 + 0 + 0), max_failed_since 2020-06-22T11:10:14.468282+0530
2020-06-22T11:10:36.015+0530 7fae6eb1e700 10 mon.a@0(leader).osd e84  osd.0 has 2 reporters, 20.000000 grace (20.000000 + 0 + 0), max_failed_since 2020-06-22T11:10:15.015638+0530
2020-06-22T11:10:36.015+0530 7fae6eb1e700  0 log_channel(cluster) log [INF] : osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 21.000461 >= grace 20.000000)
2020-06-22T11:10:36.016+0530 7fae6eb1e700 10 mon.a@0(leader).log v2813  logging 2020-06-22T11:10:36.016457+0530 mon.a (mon.0) 379 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 21.000461 >= grace 20.000000)
2020-06-22T11:10:36.627+0530 7fae71323700 10 mon.a@0(leader).osd e84 encode_pending pending_inc.modified: 2020-06-22T11:10:36.628141+0530

2020-06-22T11:10:36.694+0530 7fae6d31b700  7 mon.a@0(leader).log v2814 update_from_paxos applying incremental log 2814 2020-06-22T11:10:36.016457+0530 mon.a (mon.0) 379 : cluster [INF] osd.0 failed (root=default,host=localhost) (2 reporters from different osd after 21.000461 >= grace 20.000000)

2020-06-22T11:10:38.720+0530 7fae6eb1e700 10 mon.a@0(leader).osd e86  laggy, now xi down_stamp 2020-06-22T11:10:36.628141+0530 laggy_probability 0.51 laggy_interval 0 old_weight 0 last_purged_snaps_scrub 2020-06-22T09:36:26.557735+0530 dead_epoch 86

Actions #3

Updated by Neha Ojha almost 4 years ago

  • Backport set to nautilus,octopus
Actions #4

Updated by Sridhar Seshasayee almost 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Sridhar Seshasayee almost 4 years ago

  • Copied to Backport #46229: octopus: Ceph Monitor heartbeat grace period does not reset. added
Actions #6

Updated by Sridhar Seshasayee almost 4 years ago

  • Copied to Backport #46228: nautilus: Ceph Monitor heartbeat grace period does not reset. added
Actions #7

Updated by Nathan Cutler almost 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF