Bug #293
closedcmon crash during paxos update
0%
Description
Today i experienced some crashes of my monitors and mds'es due to my disks filling up with logs.
I had to re-start my monitors and MDS'es a few times, but then one of my monitors started to crash after it's start:
10.07.20_17:41:53.678773 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10495 10.07.20_17:41:53.678785 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10496 10.07.20_17:41:53.678797 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10497 10.07.20_17:41:53.678810 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10498 10.07.20_17:41:53.678822 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10499 10.07.20_17:41:53.678834 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10500 10.07.20_17:41:53.678883 7fa303a91710 store(/srv/ceph/mon0) set_int logm/last_committed = 10500 10.07.20_17:41:53.684828 7fa303a91710 store(/srv/ceph/mon0) reading at off 0 of 10845 10.07.20_17:41:53.684866 7fa303a91710 store(/srv/ceph/mon0) get_bl logm/latest = 10845 bytes 10.07.20_17:41:53.684878 7fa303a91710 mon0(leader).log v10500 update_from_paxos startup: loading summary e3141 10.07.20_17:41:53.684975 7fa303a91710 store(/srv/ceph/mon0) get_bl logm/3142 DNE mon/LogMonitor.cc: In function 'virtual bool LogMonitor::update_from_paxos()': mon/LogMonitor.cc:120: FAILED assert(success) 1: (PaxosService::_active()+0x36) [0x481bd6] 2: (finish_contexts(std::list<Context*, std::allocator<Context*> >&, int)+0x1b9) [0x47f179] 3: (Paxos::handle_last(MMonPaxos*)+0x3e9) [0x47df89] 4: (Paxos::dispatch(PaxosServiceMessage*)+0x203) [0x47e3c3] 5: (Monitor::_ms_dispatch(Message*)+0xb94) [0x46c064] 6: (Monitor::ms_dispatch(Message*)+0x57) [0x477e97] 7: (SimpleMessenger::dispatch_entry()+0x749) [0x450f69] 8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x44874c] 9: (Thread::_entry_func(void*)+0xa) [0x45b8ca] 10: (()+0x69ca) [0x7fa305d759ca] 11: (clone()+0x6d) [0x7fa304f956cd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I've uploaded the core, binary and logs of mon0 and mon1 to logger.ceph.widodh.nl and placed them in /srv/ceph/issues/cmon_crash_update_from_paxos
It's mon0 which is crashing, but it seems to be due to some data which mon1 is sending, that's why i've also attached mon1's log.
mon0 can't be started anymore, every time i try to, it crashes wich the same error.
Updated by Sage Weil almost 14 years ago
- Status changed from New to Can't reproduce
hmm, i fixed this by fixing logm/last_committed to have the actual last committed state (3141 i think). I'm not sure how it got to be wrong without the logs from before :(. I just fixed the log append, though... ios::ate does not work as advertised. :)