Bug #5471
closedmon: do not join a quorum if quorum's version is lower than ours
0%
Description
With p being the monitor's Paxos version, consider:
- A - p:100 (at time quorum was formed)
- B - p:100 (at time quorum was formed)
- C - p:200 (!quorum)
- C starts; probes
If C's paxos [fc,lc] overlaps with A/B's paxos [fc,lc], then there will be no sync and C joins the quorum.
During recovery, say that we have A (p:130) and C (p:200). C will then share his state from [131,200] with A. A never shares its state from [100,130] with C -> monitors are inconsistent and updates from [100,200] have been lost!
Reproducible by:
- quorum: A, B, C
- ceph tell mon.b sync force (mimics a mkfs to some extent)
- stop A
- for i in `seq 1..100`; do ceph log $i ; done
- stop all mons
- restart B
- restart A
- B syncs from A
- quorum: A, B
- restart C
It is B's failure that ends up being responsible for contaminating the cluster state. By losing B's state, and due to it being brought up after user intervention with a clean slate, and by allowing it to form a quorum with an out-of-date monitor (A), the user is allowing its cluster to pick-up from a considerably out-of-date state. This should easily be avoided by bringing C up first and letting B sync from C instead.
It is thus fair to assume that the monitors themselves don't have the responsibility on the issues resulting from all the versions lost. This case is pretty specific and it involves a monitor with a clean slate forming a quorum with an out-of-date monitor, and that shouldn't be something that just happens, leading us to conclude that the user should be aware of what he's doing.
Therefore, all we can/should do is to guarantee that C doesn't join the quorum if it notices that the current cluster has a formed quorum and its version is lower than the one it currently holds. This still doesn't avoid the issues that may rise from letting C join this same quorum at a later point in time, when the quorum's version is higher than whatever version C holds -- we would need to associate additional metadata to the paxos versions to assess at which point in time did a given version was proposed (the election epoch, for instance; this arises an issue with a cluster having a lower election epoch, eventually rising it to the same as C's, but that is more improbable to happen).
Updated by Joao Eduardo Luis almost 11 years ago
I have a simple patch for this that simply compares the quorum's version to our own paxos version and forces us to suicide if it's lower.