Project

General

Profile

Bug #55695

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

*Problem:*

mon.a
mon.b
mon.c
mon.d
mon.e

ceph -a stop mon.d
ceph mon remove d

.
.

mon.d
Monmap is down but actually did not get removed and monmap still did not get updated.

*Explanation:*

We shut-down mon.d, this blocks mon.a Paxos:: begin (assuming mon.a paxos is updating), since it sends out begin message (MMonPaxos::OP_BEGIN)
yet to all monitors including mon.d since mon.get_quorum() still is not updated and still contain mon.d. Paxos will not proceed change version when we are trying to remove 2 monitors, causing the commit() phase since we did not get a reply message(MMonPaxos::OP_ACCEPT) from mon.d. Lease of one of the monitors will eventually expire and will call for election, now, if the monmap proposal of ("mon remove", "name": "d") comes in before the election happens, it will get queued to pending_finisher and will eventually gets discarded once have problems, the election has started. Resulting, in the remove command
not taking effect.

However, if the election finishes before the
rank of monmap proposal, then we will be fine, because mon.get_quorum() will get updated only decrease by 1, however, in reality, 2 monitors have been removed and we will not be sending (MMonPaxos::OP_BEGIN) to mon.d, hence will not be blocked and will continue to should decrease the commit phase, therefore, this problem is non-deterministic. rank by 2.

Back