Project

General

Profile

Actions

Bug #4162

closed

mon: Single-Paxos: on sync, corrupted paxos store

Added by Joao Eduardo Luis about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've been thrashing the monitors pretty hard, and in this case the assert was triggered as follows:

- mon.3 sent a 'sync_start' to mon.17
- mon.17 forwarded 'sync_start' to mon.1 (leader)
- mon.1 replied to mon.3 with 'sync_start_reply'
- mon.3 sent a 'sync_start_chunks' to mon.17
- mon.17 sent chunks to mon.3

The problem here is that mon.17 was also synchronizing, thus didn't have a valid store state.

The solution can be one of two:
  • the leader specified to whom the requester should connect in order to sync
    - Upside: the leader can specify quorum members from which the monitors can sync from, and may even try to balance the load across the quorum
    - Downside: the leader might get overloaded if everybody picks him
  • the selected sync provider, if he himself is also mid-sync, forwards the request to his sync provider.
    - Upside: Likelier balance of workload, distributed across the various sync providers
    - Downside: some monitors may get overloaded, while others don't
    - Downside: seems like a crude approach (the first approach looks better, so we're going with it)
2013-02-15 15:29:57.167126 7ffcba6dc700 10 mon.f@3(synchronizing sync( requester state stop )) e1 handle_sync mon_sync( finish_reply ) v1
2013-02-15 15:29:57.167136 7ffcba6dc700 10 mon.f@3(synchronizing sync( requester state stop )) e1 handle_sync_finish_reply mon_sync( finish_reply ) v1
2013-02-15 15:29:57.167206 7ffcba6dc700 10 mon.f@3(synchronizing).paxos(paxos recovering c 0..0) reapply_all_versions first 0 last 1724
2013-02-15 15:29:57.173908 7ffcba6dc700 -1 mon/Paxos.cc: In function 'void Paxos::apply_version(MonitorDBStore::Transaction&, version_t)' thread 7ffcba6dc700 time 2013-02-15 15:29:57.167260
mon/Paxos.cc: 58: FAILED assert(bl.length())

 ceph version 0.56-786-gbf8d1ed (bf8d1ed419738a9519ee413a6a81e9ca8f99da46)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x915879]
 2: (Paxos::apply_version(MonitorDBStore::Transaction&, unsigned long)+0xb4) [0x75c3d2]
 3: (Paxos::reapply_all_versions()+0x432) [0x75c85c]
 4: (Monitor::handle_sync_finish_reply(MMonSync*)+0x401) [0x6ff2b1]
 5: (Monitor::handle_sync(MMonSync*)+0x236) [0x6ffa96]
 6: (Monitor::_ms_dispatch(Message*)+0xf6d) [0x70c663]
 7: (Monitor::ms_dispatch(Message*)+0x38) [0x72433a]
 8: (Messenger::ms_deliver_dispatch(Message*)+0x9b) [0x97ae6d]
 9: (DispatchQueue::entry()+0x549) [0x97a619]
 10: (DispatchQueue::DispatchThread::entry()+0x1c) [0x900fee]
 11: (Thread::_entry_func(void*)+0x23) [0x908f2d]
 12: (()+0x7e9a) [0x7ffcbfa36e9a]
 13: (clone()+0x6d) [0x7ffcbe1ef4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 2 (0 open2 closed)

Related to Ceph - Feature #2611: mon: Single-PaxosResolvedJoao Eduardo Luis06/20/201207/09/2012

Actions
Has duplicate Ceph - Bug #4103: mon: Single-Paxos: on MonitorDBStore, segfault during syncDuplicateJoao Eduardo Luis02/12/2013

Actions
Actions

Also available in: Atom PDF