Bug #4026
closedmon: Single-Paxos: abort on LogMonitor::update_from_paxos
0%
Description
While running teuthology with 20+ monitors, the monitor workloadgen with 10 osds, and mon thrasher, we triggered the following behavior on a peon:
2013-02-05 09:53:14.196821 7f9d0e14a700 10 mon.r@15(peon).pg v655 send_pg_creates to 0 pgs 2013-02-05 09:53:14.196831 7f9d0e14a700 10 mon.r@15(peon).pg v655 update_logger 2013-02-05 09:53:14.196902 7f9d0e14a700 10 mon.r@15(peon).pg v655 update_logger 2013-02-05 09:53:14.197039 7f9d0e14a700 10 mon.r@15(peon).mds e25 e25: 1/1/1 up {0=a=up:active} 2013-02-05 09:53:14.197072 7f9d0e14a700 10 mon.r@15(peon).mds e25 update_logger 2013-02-05 09:53:14.197167 7f9d0e14a700 10 mon.r@15(peon).osd e37 update_logger 2013-02-05 09:53:14.197180 7f9d0e14a700 10 mon.r@15(peon).osd e37 kick_all_failures on 0 osds 2013-02-05 09:53:14.209074 7f9d0e14a700 -1 *** Caught signal (Aborted) ** in thread 7f9d0e14a700 ceph version 0.56-488-gda7502a (da7502a0f7326183a02bc45f1f36c9d6b19a6450) 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x84075f] 2: /tmp/cephtest/binary/usr/local/bin/ceph-mon() [0x83fec6] 3: (()+0xfcb0) [0x7f9d12cabcb0] 4: (gsignal()+0x35) [0x7f9d113a0445] 5: (abort()+0x17b) [0x7f9d113a3bab] 6: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9d11cee69d] 7: (()+0xb5846) [0x7f9d11cec846] 8: (()+0xb5873) [0x7f9d11cec873] 9: (()+0xb596e) [0x7f9d11cec96e] 10: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0xc8) [0x90e6c6] 11: (void decode_raw<unsigned char>(unsigned char&, ceph::buffer::list::iterator&)+0x25) [0x7253e1] 12: (decode(unsigned char&, ceph::buffer::list::iterator&)+0x23) [0x714f58] 13: (LogMonitor::update_from_paxos()+0x44c) [0x7f23e6] 14: (PaxosService::_active()+0x2b1) [0x768dbf] 15: (PaxosService::C_Active::finish(int)+0x25) [0x76a4b9] 16: (Context::complete(int)+0x2b) [0x7166fb] 17: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x259) [0x71698f] 18: (Paxos::handle_lease(MMonPaxos*)+0x7d3) [0x7604d9] 19: (Paxos::dispatch(PaxosServiceMessage*)+0x337) [0x762f4b] 20: (Monitor::_ms_dispatch(Message*)+0x138e) [0x708fec] 21: (Monitor::ms_dispatch(Message*)+0x38) [0x71f598] 22: (Messenger::ms_deliver_dispatch(Message*)+0x9b) [0x971a05] 23: (DispatchQueue::entry()+0x549) [0x9711b1] 24: (DispatchQueue::DispatchThread::entry()+0x1c) [0x8f86a4] 25: (Thread::_entry_func(void*)+0x23) [0x9005a1] 26: (()+0x7e9a) [0x7f9d12ca3e9a] 27: (clone()+0x6d) [0x7f9d1145c4bd]
Files
Updated by Joao Eduardo Luis about 11 years ago
- File 4026.tar.bz2 4026.tar.bz2 added
Updated by Joao Eduardo Luis about 11 years ago
Haven't been able to reproduce this nor to find an obvious cause for this to have happened.
After inspecting the store and comparing the versions within with those of the leader's store, nothing appeared to be wrong.
This appears to have happened when decoding the first byte on the bufferlist (corresponding to the log version? not sure), but we didn't have much debug infos on this function to pinpoint exactly where it happened; gdb wasn't much help either as it was complaining about being unable to resolve the overloaded instance (maybe lack of debug symbols on the gitbuilder build?).
Anyway, I've pushed a patch to make this function a bit more verbose and will be re-running this test in the off-chance of reproducing this bug.
Updated by Joao Eduardo Luis about 11 years ago
- Status changed from New to In Progress
Updated by Joao Eduardo Luis about 11 years ago
- Status changed from In Progress to Resolved