Project

General

Profile

Actions

Bug #293

closed

cmon crash during paxos update

Added by Wido den Hollander almost 14 years ago. Updated over 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Today i experienced some crashes of my monitors and mds'es due to my disks filling up with logs.

I had to re-start my monitors and MDS'es a few times, but then one of my monitors started to crash after it's start:

10.07.20_17:41:53.678773 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10495
10.07.20_17:41:53.678785 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10496
10.07.20_17:41:53.678797 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10497
10.07.20_17:41:53.678810 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10498
10.07.20_17:41:53.678822 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10499
10.07.20_17:41:53.678834 7fa303a91710 store(/srv/ceph/mon0) exists_bl logm/10500
10.07.20_17:41:53.678883 7fa303a91710 store(/srv/ceph/mon0) set_int logm/last_committed = 10500
10.07.20_17:41:53.684828 7fa303a91710 store(/srv/ceph/mon0) reading at off 0 of 10845
10.07.20_17:41:53.684866 7fa303a91710 store(/srv/ceph/mon0) get_bl logm/latest = 10845 bytes
10.07.20_17:41:53.684878 7fa303a91710 mon0(leader).log v10500 update_from_paxos startup: loading summary e3141
10.07.20_17:41:53.684975 7fa303a91710 store(/srv/ceph/mon0) get_bl logm/3142 DNE
mon/LogMonitor.cc: In function 'virtual bool LogMonitor::update_from_paxos()':
mon/LogMonitor.cc:120: FAILED assert(success)
 1: (PaxosService::_active()+0x36) [0x481bd6]
 2: (finish_contexts(std::list<Context*, std::allocator<Context*> >&, int)+0x1b9) [0x47f179]
 3: (Paxos::handle_last(MMonPaxos*)+0x3e9) [0x47df89]
 4: (Paxos::dispatch(PaxosServiceMessage*)+0x203) [0x47e3c3]
 5: (Monitor::_ms_dispatch(Message*)+0xb94) [0x46c064]
 6: (Monitor::ms_dispatch(Message*)+0x57) [0x477e97]
 7: (SimpleMessenger::dispatch_entry()+0x749) [0x450f69]
 8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x44874c]
 9: (Thread::_entry_func(void*)+0xa) [0x45b8ca]
 10: (()+0x69ca) [0x7fa305d759ca]
 11: (clone()+0x6d) [0x7fa304f956cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I've uploaded the core, binary and logs of mon0 and mon1 to logger.ceph.widodh.nl and placed them in /srv/ceph/issues/cmon_crash_update_from_paxos

It's mon0 which is crashing, but it seems to be due to some data which mon1 is sending, that's why i've also attached mon1's log.

mon0 can't be started anymore, every time i try to, it crashes wich the same error.

Actions #1

Updated by Sage Weil almost 14 years ago

  • Status changed from New to Can't reproduce

hmm, i fixed this by fixing logm/last_committed to have the actual last committed state (3141 i think). I'm not sure how it got to be wrong without the logs from before :(. I just fixed the log append, though... ios::ate does not work as advertised. :)

Actions

Also available in: Atom PDF