Project

General

Profile

Bug #4162

Updated by Joao Eduardo Luis about 11 years ago

We've been thrashing the monitors pretty hard, and in this case the assert was triggered as follows: 

 - mon.3 sent a 'sync_start' to mon.17 
 - mon.17 forwarded 'sync_start' to mon.1 (leader) 
 - mon.1 replied to mon.3 with 'sync_start_reply' 
 - mon.3 sent a 'sync_start_chunks' to mon.17 
 - mon.17 sent chunks to mon.3 

 The problem here is that mon.17 was also synchronizing, thus didn't have a valid store state. 

 The solution can be one of two: 
 * the leader specified to whom the requester should connect in order to sync 
  - Upside: the leader can specify quorum members from which the monitors can sync from, and may even try to balance the load across the quorum 
  - Downside: the leader might get overloaded if everybody picks him 
 * the selected sync provider, if he himself is also mid-sync, forwards the request to his sync provider. 
  - Upside: Likelier balance of workload, distributed across the various sync providers 
  - Downside: some monitors may get overloaded, while others don't 

 <pre> 
 2013-02-15 15:29:57.167126 7ffcba6dc700 10 mon.f@3(synchronizing sync( requester state stop )) e1 handle_sync mon_sync( finish_reply ) v1 
 2013-02-15 15:29:57.167136 7ffcba6dc700 10 mon.f@3(synchronizing sync( requester state stop )) e1 handle_sync_finish_reply mon_sync( finish_reply ) v1 
 2013-02-15 15:29:57.167206 7ffcba6dc700 10 mon.f@3(synchronizing).paxos(paxos recovering c 0..0) reapply_all_versions first 0 last 1724 
 2013-02-15 15:29:57.173908 7ffcba6dc700 -1 mon/Paxos.cc: In function 'void Paxos::apply_version(MonitorDBStore::Transaction&, version_t)' thread 7ffcba6dc700 time 2013-02-15 15:29:57.167260 
 mon/Paxos.cc: 58: FAILED assert(bl.length()) 

  ceph version 0.56-786-gbf8d1ed (bf8d1ed419738a9519ee413a6a81e9ca8f99da46) 
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x915879] 
  2: (Paxos::apply_version(MonitorDBStore::Transaction&, unsigned long)+0xb4) [0x75c3d2] 
  3: (Paxos::reapply_all_versions()+0x432) [0x75c85c] 
  4: (Monitor::handle_sync_finish_reply(MMonSync*)+0x401) [0x6ff2b1] 
  5: (Monitor::handle_sync(MMonSync*)+0x236) [0x6ffa96] 
  6: (Monitor::_ms_dispatch(Message*)+0xf6d) [0x70c663] 
  7: (Monitor::ms_dispatch(Message*)+0x38) [0x72433a] 
  8: (Messenger::ms_deliver_dispatch(Message*)+0x9b) [0x97ae6d] 
  9: (DispatchQueue::entry()+0x549) [0x97a619] 
  10: (DispatchQueue::DispatchThread::entry()+0x1c) [0x900fee] 
  11: (Thread::_entry_func(void*)+0x23) [0x908f2d] 
  12: (()+0x7e9a) [0x7ffcbfa36e9a] 
  13: (clone()+0x6d) [0x7ffcbe1ef4bd] 
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 
 </pre>

Back