Project

General

Profile

Bug #55695

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

*Problem:* 

 mon.a 
 mon.b 
 mon.c  
 mon.d 
 mon.e 


 ceph -a stop mon.d  
 ceph mon remove d 

 . 
 . 

 mon.d is down but actually did not get removed and monmap still did not get updated. 

 *Explanation:* 

 We shut down shut-down mon.d, this blocks mon.a Paxos:: begin (assuming Paxos mon.a paxos is updating), since mon.a (leader) it sends out begin message (MMonPaxos::OP_BEGIN) to all peer monitors including mon.d since mon.get_quorum() still is not updated and still contain mon.d. Paxos will not proceed to the commit() phase since we did not get a reply message(MMonPaxos::OP_ACCEPT) from mon.d. The lease Lease of one of the monitors will eventually expire and will call for an election, now, if the monmap proposal of ("mon remove", "name": "d") comes in before the election happens, it will get queued to pending_finisher and will eventually get gets discarded once the election has started. Result, Resulting, in the remove command 
 not taking effect. 

 However, if the election finishes before the monmap proposal, then we will be fine fine, because mon.get_quorum() will get updated and we will not be sending (MMonPaxos::OP_BEGIN) to mon.d, hence will not be blocked and will continue to the commit phase, therefore, this problem is non-deterministic.

Back