Bug #58049
mon:stretch-cluster: mishandled removed_ranks -> inconsistent peer_tracker leading to unable to form quorum
Status:
Pending Backport
Priority:
Urgent
Assignee:
Category:
Stretch Clusters
Target version:
-
% Done:
0%
Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
First encountered in the downstream: https://bugzilla.redhat.com/show_bug.cgi?id=2142674
When we failover monitors many times in the stretch cluster, there are instances where Ceph becomes
unresponsive due to monitors not being able to form a quorum.
We have investigated and concluded that this is due to how we mishandled `removed_ranks` in MonMap which
leads to inconsistent peer_tracker which then leads to deadlock election state of the monitor, which means
they cannot form a quorum -> ceph becomes unresponsive.
Related issues
History
#1 Updated by Radoslaw Zarzynski 2 months ago
- Pull request ID set to 48991
#2 Updated by Kamoltat (Junior) Sirivadhna about 2 months ago
- Related to Bug #58107: mon-stretch: old stretch_marked_down_mons leads to ceph unresponsive added
#3 Updated by Radoslaw Zarzynski 23 days ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to pacific,quincy
#4 Updated by Backport Bot 23 days ago
- Copied to Backport #58380: pacific: mon:stretch-cluster: mishandled removed_ranks -> inconsistent peer_tracker leading to unable to form quorum added
#5 Updated by Backport Bot 23 days ago
- Copied to Backport #58381: quincy: mon:stretch-cluster: mishandled removed_ranks -> inconsistent peer_tracker leading to unable to form quorum added
#6 Updated by Backport Bot 23 days ago
- Tags set to backport_processed