Project

General

Profile

Actions

Bug #5698

closed

mon: paxos mishandles uncommitted values during collect/last phase

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
cuttlefish
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2013-07-19_20:00:19-rados-cuttlefish-testing-basic/73992

mon.b crashed with

    -1> 2013-07-19 20:07:55.289677 7f980cf0b700 15 mon.b@0(leader).osd e77 update_from_paxos paxos e 96, my e 77
     0> 2013-07-19 20:07:55.305732 7f980cf0b700 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f980cf0b700 time 2013-07-19 20:07:55.289704
mon/OSDMonitor.cc: 142: FAILED assert(err == 0)

 ceph version 0.61.5-1-ga0cb40b (a0cb40b45c4f2f921a63c2d7bb5a28572381d793)
 1: (OSDMonitor::update_from_paxos(bool*)+0x1861) [0x5119f1]
 2: (PaxosService::refresh(bool*)+0x19b) [0x4f922b]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x49a587]
 4: (Paxos::finish_proposal()+0x44) [0x4eeca4]
 5: (Paxos::handle_accept(MMonPaxos*)+0x785) [0x4efb65]
 6: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4f27eb]
 7: (Monitor::_ms_dispatch(Message*)+0x1029) [0x4c6529]
 8: (Monitor::ms_dispatch(Message*)+0x32) [0x4e1982]
 9: (DispatchQueue::entry()+0x3f1) [0x6b98f1]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x647a4d]
 11: (()+0x7e9a) [0x7f9811c6ee9a]
 12: (clone()+0x6d) [0x7f981021eccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

and osdmap / 78 is indeed missing from the store (copy in mon-b):

...
osdmap / 73
osdmap / 74
osdmap / 75
osdmap / 76
osdmap / 77
osdmap / 79
osdmap / 8
osdmap / 80
...

Actions #1

Updated by Sage Weil almost 11 years ago

the uncommitted value learning is broken. on the leader:

2013-07-19 20:07:45.275946 7f496a56b700 10 mon.a@1(leader).paxos(paxos recovering c 1..261) learned uncommitted 262 (13712 bytes) from myself

but then
2013-07-19 20:07:45.368689 7f4969d6a700 10 mon.a@1(leader).paxos(paxos recovering c 1..261) we learned an uncommitted value for 200 pn 901 423 bytes

the pn is wrong, and it is an old(er) value.
Actions #2

Updated by Sage Weil almost 11 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Greg Farnum
  • Priority changed from Urgent to Immediate

see wip-paxos

Actions #3

Updated by Sage Weil almost 11 years ago

manually verified this behaves with multiple uncommitted values with different pns using the failure injection points. we should build a teuthology test do to that, but i'm out of time for this morning

Actions #4

Updated by Sage Weil almost 11 years ago

  • Priority changed from Immediate to Urgent
Actions #5

Updated by Sage Weil almost 11 years ago

  • Subject changed from mon: missing inc osdmap on cuttlefish to mon: paxos mishandles uncommitted values during collect/last phase
  • Backport set to cuttlefish
Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Sage Weil almost 11 years ago

  • Assignee deleted (Greg Farnum)
Actions #8

Updated by Sage Weil almost 11 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF