Bug #58107
closedmon-stretch: old stretch_marked_down_mons leads to ceph unresponsive
0%
Description
How to reproduce the issue¶
Set up:¶
mon.a (zone 1) rank=0
mon.b (zone 1) rank=1
mon.c (zone 2) rank=2
mon.d (zone 2) rank=3
mon.e (arbiter) rank=4
stretch_mode cluster with 2 zones 4 mons (2 each zones) and 4 OSDs (2 each zones).
shutdown zone 2 and wait til enter degraded stretch-mode
start zone 2
immediately shutdown zone1.
Result:¶
ceph becomes unresponsive
Explanation:¶
e0 quorum = {a, b, c, d, e} stretch_marked_down_mons = {} disallowed_leader {e}
e1 quorum = {a, b, e} stretch_marked_down_mons = {c, d} disallowed_leader {e}
mon.c starts back, up probe mon.b and gets map e1 (stretch_marked_down_mons = {c, d})
mon.d starts back, up probe mon.b and gets map e1 (stretch_marked_down_mons = {c, d})
we go into the function: Monitor::set_elector_disallowed_leaders() elector.disallowed_leaders = {c,d,e}
Within the same monmap we shutdown zone1
e1 quorum = { c, d, e} stretch_marked_down_mons = {c, d} disallowed_leader {e}
During an election every monitor is a disallowed_leader and no one will ever win an election. The only way we can get out of this is by starting back zone1.
The only way to clear monmap::stretch_marked_down_mons is through Monitor::trigger_healthy_stretch_mode(), which you need to be the leader to execute this function, and since we are in election when this happens, there is no chance we can go into trigger_healthy_stretch_mode().