Actions
Bug #4162
closedmon: Single-Paxos: on sync, corrupted paxos store
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We've been thrashing the monitors pretty hard, and in this case the assert was triggered as follows:
- mon.3 sent a 'sync_start' to mon.17
- mon.17 forwarded 'sync_start' to mon.1 (leader)
- mon.1 replied to mon.3 with 'sync_start_reply'
- mon.3 sent a 'sync_start_chunks' to mon.17
- mon.17 sent chunks to mon.3
The problem here is that mon.17 was also synchronizing, thus didn't have a valid store state.
The solution can be one of two:- the leader specified to whom the requester should connect in order to sync
- Upside: the leader can specify quorum members from which the monitors can sync from, and may even try to balance the load across the quorum
- Downside: the leader might get overloaded if everybody picks him - the selected sync provider, if he himself is also mid-sync, forwards the request to his sync provider.
- Upside: Likelier balance of workload, distributed across the various sync providers
- Downside: some monitors may get overloaded, while others don't
- Downside: seems like a crude approach (the first approach looks better, so we're going with it)
2013-02-15 15:29:57.167126 7ffcba6dc700 10 mon.f@3(synchronizing sync( requester state stop )) e1 handle_sync mon_sync( finish_reply ) v1 2013-02-15 15:29:57.167136 7ffcba6dc700 10 mon.f@3(synchronizing sync( requester state stop )) e1 handle_sync_finish_reply mon_sync( finish_reply ) v1 2013-02-15 15:29:57.167206 7ffcba6dc700 10 mon.f@3(synchronizing).paxos(paxos recovering c 0..0) reapply_all_versions first 0 last 1724 2013-02-15 15:29:57.173908 7ffcba6dc700 -1 mon/Paxos.cc: In function 'void Paxos::apply_version(MonitorDBStore::Transaction&, version_t)' thread 7ffcba6dc700 time 2013-02-15 15:29:57.167260 mon/Paxos.cc: 58: FAILED assert(bl.length()) ceph version 0.56-786-gbf8d1ed (bf8d1ed419738a9519ee413a6a81e9ca8f99da46) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x915879] 2: (Paxos::apply_version(MonitorDBStore::Transaction&, unsigned long)+0xb4) [0x75c3d2] 3: (Paxos::reapply_all_versions()+0x432) [0x75c85c] 4: (Monitor::handle_sync_finish_reply(MMonSync*)+0x401) [0x6ff2b1] 5: (Monitor::handle_sync(MMonSync*)+0x236) [0x6ffa96] 6: (Monitor::_ms_dispatch(Message*)+0xf6d) [0x70c663] 7: (Monitor::ms_dispatch(Message*)+0x38) [0x72433a] 8: (Messenger::ms_deliver_dispatch(Message*)+0x9b) [0x97ae6d] 9: (DispatchQueue::entry()+0x549) [0x97a619] 10: (DispatchQueue::DispatchThread::entry()+0x1c) [0x900fee] 11: (Thread::_entry_func(void*)+0x23) [0x908f2d] 12: (()+0x7e9a) [0x7ffcbfa36e9a] 13: (clone()+0x6d) [0x7ffcbe1ef4bd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions