Actions
Bug #9301
closedpaxos: off by one w/ versions in forming quorum
% Done:
0%
Source:
Q/A
Tags:
Backport:
giant,firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
we are stuck in an election loop and seeing
2014-08-30 22:12:15.636434 7fdda3080700 10 mon.f@0(leader).paxos(paxos recovering c 738..752) handle_last paxos(last lc 737 fc 716 pn 1099600 opn 0) v3 2014-08-30 22:12:15.636439 7fdda3080700 10 mon.f@0(leader).paxos(paxos recovering c 738..752) store_state nothing to commit 2014-08-30 22:12:15.636470 7fdda3080700 5 mon.f@0(leader).paxos(paxos recovering c 738..752) handle_last peon 5 last_committed (737) is too low for our first_committed (738) -- bootstrap!
either we should be tolerating this case (there is no gap between 737 and 738!) or the bootstrap check is wrong.
handle_probe_reply seems to have the right check:
if (paxos->get_version() < m->paxos_first_version && m->paxos_first_version > 1) { // no need to sync if we're 0 and they start at 1.
...but mon.c (mon.5) isn't running this code because it is just doing elections and not probing.
i don't think we want to re-bootstrap if we can avoid it, so i'm not sure what we should be doing here. Maybe:
1) the leader, when it sees this condition, send a message telling the peer to rebootstrap
2) the peon could do this same check in handle_probe_probe and, if it sees it is behind, re-bootstrap.
i'm thinking #2?
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-08-30_20:39:25-rados-wip-sage-testing-testing-basic-multi/462283
Actions