Project

General

Profile

Actions

Bug #9301

closed

paxos: off by one w/ versions in forming quorum

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
giant,firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

we are stuck in an election loop and seeing

2014-08-30 22:12:15.636434 7fdda3080700 10 mon.f@0(leader).paxos(paxos recovering c 738..752) handle_last paxos(last lc 737 fc 716 pn 1099600 opn 0) v3
2014-08-30 22:12:15.636439 7fdda3080700 10 mon.f@0(leader).paxos(paxos recovering c 738..752) store_state nothing to commit
2014-08-30 22:12:15.636470 7fdda3080700  5 mon.f@0(leader).paxos(paxos recovering c 738..752) handle_last peon 5 last_committed (737) is too low for our first_committed (738) -- bootstrap!

either we should be tolerating this case (there is no gap between 737 and 738!) or the bootstrap check is wrong.

handle_probe_reply seems to have the right check:

    if (paxos->get_version() < m->paxos_first_version &&
    m->paxos_first_version > 1) {  // no need to sync if we're 0 and they start at 1.

...but mon.c (mon.5) isn't running this code because it is just doing elections and not probing.

i don't think we want to re-bootstrap if we can avoid it, so i'm not sure what we should be doing here. Maybe:

1) the leader, when it sees this condition, send a message telling the peer to rebootstrap
2) the peon could do this same check in handle_probe_probe and, if it sees it is behind, re-bootstrap.

i'm thinking #2?

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-08-30_20:39:25-rados-wip-sage-testing-testing-basic-multi/462283

Actions #1

Updated by Sage Weil over 9 years ago

  • Status changed from 12 to 7
  • Assignee set to Sage Weil
Actions #2

Updated by Sage Weil over 9 years ago

  • Status changed from 7 to Pending Backport
  • Backport set to giant,firefly
Actions #3

Updated by Samuel Just over 9 years ago

  • Status changed from Pending Backport to Resolved

merged to firefly

Actions

Also available in: Atom PDF