Actions
Bug #5698
closedmon: paxos mishandles uncommitted values during collect/last phase
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Monitor
Target version:
-
% Done:
0%
Source:
Q/A
Tags:
Backport:
cuttlefish
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
ubuntu@teuthology:/a/teuthology-2013-07-19_20:00:19-rados-cuttlefish-testing-basic/73992
mon.b crashed with
-1> 2013-07-19 20:07:55.289677 7f980cf0b700 15 mon.b@0(leader).osd e77 update_from_paxos paxos e 96, my e 77 0> 2013-07-19 20:07:55.305732 7f980cf0b700 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f980cf0b700 time 2013-07-19 20:07:55.289704 mon/OSDMonitor.cc: 142: FAILED assert(err == 0) ceph version 0.61.5-1-ga0cb40b (a0cb40b45c4f2f921a63c2d7bb5a28572381d793) 1: (OSDMonitor::update_from_paxos(bool*)+0x1861) [0x5119f1] 2: (PaxosService::refresh(bool*)+0x19b) [0x4f922b] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x49a587] 4: (Paxos::finish_proposal()+0x44) [0x4eeca4] 5: (Paxos::handle_accept(MMonPaxos*)+0x785) [0x4efb65] 6: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4f27eb] 7: (Monitor::_ms_dispatch(Message*)+0x1029) [0x4c6529] 8: (Monitor::ms_dispatch(Message*)+0x32) [0x4e1982] 9: (DispatchQueue::entry()+0x3f1) [0x6b98f1] 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x647a4d] 11: (()+0x7e9a) [0x7f9811c6ee9a] 12: (clone()+0x6d) [0x7f981021eccd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
and osdmap / 78 is indeed missing from the store (copy in mon-b):
... osdmap / 73 osdmap / 74 osdmap / 75 osdmap / 76 osdmap / 77 osdmap / 79 osdmap / 8 osdmap / 80 ...
Updated by Sage Weil almost 11 years ago
the uncommitted value learning is broken. on the leader:
2013-07-19 20:07:45.275946 7f496a56b700 10 mon.a@1(leader).paxos(paxos recovering c 1..261) learned uncommitted 262 (13712 bytes) from myself
but then
2013-07-19 20:07:45.368689 7f4969d6a700 10 mon.a@1(leader).paxos(paxos recovering c 1..261) we learned an uncommitted value for 200 pn 901 423 bytes
the pn is wrong, and it is an old(er) value.
Updated by Sage Weil almost 11 years ago
- Status changed from New to Fix Under Review
- Assignee set to Greg Farnum
- Priority changed from Urgent to Immediate
see wip-paxos
Updated by Sage Weil almost 11 years ago
manually verified this behaves with multiple uncommitted values with different pns using the failure injection points. we should build a teuthology test do to that, but i'm out of time for this morning
Updated by Sage Weil almost 11 years ago
- Priority changed from Immediate to Urgent
Updated by Sage Weil almost 11 years ago
- Subject changed from mon: missing inc osdmap on cuttlefish to mon: paxos mishandles uncommitted values during collect/last phase
- Backport set to cuttlefish
Updated by Sage Weil almost 11 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Sage Weil almost 11 years ago
- Status changed from Pending Backport to Resolved
Actions