Project

General

Profile

Actions

Bug #1414

closed

cmon crash Paxos::handle_collect (this=0x22a9000, collect=0x22e56c0) at ../../src/mon/Paxos.cc:145

Added by Sam Lang over 12 years ago. Updated over 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Monitor
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I see this repeatedly in my current setup:

Core was generated by `/usr/ceph/bin/cmon i bravo -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0 0x00007fd9e6af0b3b in raise ()---Type <return> to continue, or q <return> to quit--

(gdb) bt
#0 0x00007fd9e6af0b3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00000000007b4c16 in reraise_fatal (signum=6) at ../../src/global/signal_handler.cc:59
#2 0x00000000007b4e34 in handle_fatal_signal (signum=6) at ../../src/global/signal_handler.cc:106
#3 <signal handler called>
#4 0x00007fd9e547ad05 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007fd9e547eab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x00007fd9e5d316dd in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fd9e5d2f926 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007fd9e5d2f953 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007fd9e5d2fa5e in _cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000007b48bb in ceph::
_ceph_assert_fail (assertion=0x7cf8b7 "bl.length() > 0", file=0x7cf7f8 "../../src/mon/Paxos.cc", line=145, func=0x7d0880 "void Paxos::handle_collect(MMonPaxos*)") at ../../src/common/assert.cc:70
#11 0x000000000064ed05 in Paxos::handle_collect (this=0x22a9000, collect=0x22e56c0) at ../../src/mon/Paxos.cc:145
#12 0x00000000006555b4 in Paxos::dispatch (this=0x22a9000, m=0x22e56c0) at ../../src/mon/Paxos.cc:820
#13 0x000000000062fd9d in Monitor::_ms_dispatch (this=0x22cfa00, m=0x22e56c0) at ../../src/mon/Monitor.cc:800
#14 0x000000000063726c in Monitor::ms_dispatch (this=0x22cfa00, m=0x22e56c0) at ../../src/mon/Monitor.h:204
#15 0x0000000000780622 in Messenger::ms_deliver_dispatch (this=0x22cf000, m=0x22e56c0) at ../../src/msg/Messenger.h:102
#16 0x000000000076a264 in SimpleMessenger::dispatch_entry (this=0x22cf000) at ../../src/msg/SimpleMessenger.cc:356
#17 0x000000000061c8ae in SimpleMessenger::DispatchThread::entry (this=0x22cf488) at ../../src/msg/SimpleMessenger.h:546
#18 0x00000000006f07d5 in Thread::_entry_func (arg=0x22cf488) at ../../src/common/Thread.cc:45
#19 0x00007fd9e6ae7d8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007fd9e552d04d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#21 0x0000000000000000 in ?? ()

The last part of the log is:

2011-08-19 00:13:54.231567 7f32e639a760 log [INF] : mon.bravo calling new monitor election
2011-08-19 00:13:54.243782 7f32e6396700 -- 192.168.101.113:6789/0 >> 192.168.101.112:6789/0 pipe(0x2d47500 sd=9 pgs=0 cs=0 l=0).fault first fault
2011-08-19 00:13:54.243874 7f32e16b0700 -- 192.168.101.113:6789/0 >> 192.168.101.115:6789/0 pipe(0x2d47280 sd=10 pgs=0 cs=0 l=0).fault first fault
2011-08-19 00:13:54.282302 7f32e15af700 -- 192.168.101.113:6789/0 >> 192.168.101.112:6789/0 pipe(0x2d47780 sd=11 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 1
2011-08-19 00:13:54.397402 7f32e12ac700 -- 192.168.101.113:6789/0 >> 192.168.101.115:6789/0 pipe(0x2d47500 sd=12 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 1
../../src/mon/Paxos.cc: In function 'void Paxos::handle_collect(MMonPaxos*)', in thread '0x7f32e26b2700'
../../src/mon/Paxos.cc: 145: FAILED assert(bl.length() > 0)
ceph version (commit:)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x89) [0x7b3261]
2: (Paxos::handle_collect(MMonPaxos*)+0x697) [0x64e945]
3: (Paxos::dispatch(PaxosServiceMessage*)+0x22c) [0x6551f4]
4: (Monitor::_ms_dispatch(Message*)+0xd23) [0x62f9dd]
5: (Monitor::ms_dispatch(Message*)+0x3a) [0x636eac]
6: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0x77f13a]
7: (SimpleMessenger::dispatch_entry()+0x810) [0x768d74]
8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x61c4ee]
9: (Thread::_entry_func(void*)+0x23) [0x6f03cd]
10: (()+0x6d8c) [0x7f32e5f70d8c]
11: (clone()+0x6d) [0x7f32e49b604d]
ceph version (commit:)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x89) [0x7b3261]
2: (Paxos::handle_collect(MMonPaxos*)+0x697) [0x64e945]
3: (Paxos::dispatch(PaxosServiceMessage*)+0x22c) [0x6551f4]
4: (Monitor::_ms_dispatch(Message*)+0xd23) [0x62f9dd]
5: (Monitor::ms_dispatch(Message*)+0x3a) [0x636eac]
6: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0x77f13a]
7: (SimpleMessenger::dispatch_entry()+0x810) [0x768d74]
8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x61c4ee]
9: (Thread::_entry_func(void*)+0x23) [0x6f03cd]
10: (()+0x6d8c) [0x7f32e5f70d8c]
11: (clone()+0x6d) [0x7f32e49b604d]
  • Caught signal (Aborted) *
    in thread 0x7f32e26b2700
    ceph version (commit:)
    1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x799f05]
    2: /usr/ceph/bin/cmon() [0x7b38d3]
    3: (()+0xfc60) [0x7f32e5f79c60]
    4: (gsignal()+0x35) [0x7f32e4903d05]
    5: (abort()+0x186) [0x7f32e4907ab6]
    6: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f32e51ba6dd]
    7: (()+0xb9926) [0x7f32e51b8926]
    8: (()+0xb9953) [0x7f32e51b8953]
    9: (()+0xb9a5e) [0x7f32e51b8a5e]
    10: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x1f3) [0x7b33cb]
    11: (Paxos::handle_collect(MMonPaxos*)+0x697) [0x64e945]
    12: (Paxos::dispatch(PaxosServiceMessage*)+0x22c) [0x6551f4]
    13: (Monitor::_ms_dispatch(Message*)+0xd23) [0x62f9dd]
    14: (Monitor::ms_dispatch(Message*)+0x3a) [0x636eac]
    15: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0x77f13a]
    16: (SimpleMessenger::dispatch_entry()+0x810) [0x768d74]
    17: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x61c4ee]
    18: (Thread::_entry_func(void*)+0x23) [0x6f03cd]
    19: (()+0x6d8c) [0x7f32e5f70d8c]
    20: (clone()+0x6d) [0x7f32e49b604d]
    2011-08-19 09:48:04.467717 7fd9e6f11760 ceph version .commit: . process: cmon. pid: 9320
    2011-08-19 09:48:04.468163 7fd9e6f11760 store(/data/mon.bravo) mount
    2011-08-19 09:48:04.470538 7fd9e6f11760 mon.bravo@1(starting) e1 init fsid 12b23328-79c8-09a0-6be5-d154d246232b
    2011-08-19 09:48:04.478382 7fd9e6f11760 log [INF] : mon.bravo calling new monitor election
    2011-08-19 09:48:04.489760 7fd9e2227700 -- 192.168.101.113:6789/0 >> 192.168.101.115:6789/0 pipe(0x22d0280 sd=10 pgs=0 cs=0 l=0).fault first fault
    2011-08-19 09:48:04.569612 7fd9e1f24700 -- 192.168.101.113:6789/0 >> 192.168.101.115:6789/0 pipe(0x22d0c80 sd=12 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 1
    2011-08-19 09:48:04.640694 7fd9e1e23700 -- 192.168.101.113:6789/0 >> 192.168.101.112:6789/0 pipe(0x22d0a00 sd=13 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 6
    ../../src/mon/Paxos.cc: In function 'void Paxos::handle_collect(MMonPaxos*)', in thread '0x7fd9e3229700'
    ../../src/mon/Paxos.cc: 145: FAILED assert(bl.length() > 0)
    ceph version (commit:)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x89) [0x7b4751]
    2: (Paxos::handle_collect(MMonPaxos*)+0x697) [0x64ed05]
    3: (Paxos::dispatch(PaxosServiceMessage*)+0x22c) [0x6555b4]
    4: (Monitor::_ms_dispatch(Message*)+0xd23) [0x62fd9d]
    5: (Monitor::ms_dispatch(Message*)+0x3a) [0x63726c]
    6: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0x780622]
    7: (SimpleMessenger::dispatch_entry()+0x810) [0x76a264]
    8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x61c8ae]
    9: (Thread::_entry_func(void*)+0x23) [0x6f07d5]
    10: (()+0x6d8c) [0x7fd9e6ae7d8c]
    11: (clone()+0x6d) [0x7fd9e552d04d]
    ceph version (commit:)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x89) [0x7b4751]
    2: (Paxos::handle_collect(MMonPaxos*)+0x697) [0x64ed05]
    3: (Paxos::dispatch(PaxosServiceMessage*)+0x22c) [0x6555b4]
    4: (Monitor::_ms_dispatch(Message*)+0xd23) [0x62fd9d]
    5: (Monitor::ms_dispatch(Message*)+0x3a) [0x63726c]
    6: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0x780622]
    7: (SimpleMessenger::dispatch_entry()+0x810) [0x76a264]
    8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x61c8ae]
    9: (Thread::_entry_func(void*)+0x23) [0x6f07d5]
    10: (()+0x6d8c) [0x7fd9e6ae7d8c]
    11: (clone()+0x6d) [0x7fd9e552d04d]
  • Caught signal (Aborted) *
    in thread 0x7fd9e3229700
    ceph version (commit:)
    1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x79b3ed]
    2: /usr/ceph/bin/cmon() [0x7b4dc3]
    3: (()+0xfc60) [0x7fd9e6af0c60]
    4: (gsignal()+0x35) [0x7fd9e547ad05]
    5: (abort()+0x186) [0x7fd9e547eab6]
    6: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fd9e5d316dd]
    7: (()+0xb9926) [0x7fd9e5d2f926]
    8: (()+0xb9953) [0x7fd9e5d2f953]
    9: (()+0xb9a5e) [0x7fd9e5d2fa5e]
    10: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x1f3) [0x7b48bb]
    11: (Paxos::handle_collect(MMonPaxos*)+0x697) [0x64ed05]
    12: (Paxos::dispatch(PaxosServiceMessage*)+0x22c) [0x6555b4]
    13: (Monitor::_ms_dispatch(Message*)+0xd23) [0x62fd9d]
    14: (Monitor::ms_dispatch(Message*)+0x3a) [0x63726c]
    15: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0x780622]
    16: (SimpleMessenger::dispatch_entry()+0x810) [0x76a264]
    17: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x61c8ae]
    18: (Thread::_entry_func(void*)+0x23) [0x6f07d5]
    19: (()+0x6d8c) [0x7fd9e6ae7d8c]
    20: (clone()+0x6d) [0x7fd9e552d04d]
Actions #1

Updated by Sam Lang over 12 years ago

With debug mon = 20

2011-08-19 10:15:24.033788 7f4b5fb05760 ceph version .commit: . process: cmon. pid: 14962
2011-08-19 10:15:24.033841 7f4b5fb05760 store(/data/mon.bravo) mount
2011-08-19 10:15:24.033872 7f4b5fb05760 store(/data/mon.bravo) reading at off 0 of 21
2011-08-19 10:15:24.033880 7f4b5fb05760 store(/data/mon.bravo) get_bl magic = 21 bytes
2011-08-19 10:15:24.033889 7f4b5fb05760 store(/data/mon.bravo) reading at off 0 of 75
2011-08-19 10:15:24.033893 7f4b5fb05760 store(/data/mon.bravo) get_bl feature_set = 75 bytes
2011-08-19 10:15:24.033907 7f4b5fb05760 store(/data/mon.bravo) reading at off 0 of 489
2011-08-19 10:15:24.033913 7f4b5fb05760 store(/data/mon.bravo) get_bl monmap/latest = 489 bytes
2011-08-19 10:15:24.035316 7f4b5fb05760 mon.bravo@1(starting) e1 init fsid 12b23328-79c8-09a0-6be5-d154d246232b
2011-08-19 10:15:24.035376 7f4b5fb05760 store(/data/mon.bravo) get_int pgmap/last_pn = 2201
2011-08-19 10:15:24.035398 7f4b5fb05760 store(/data/mon.bravo) get_int pgmap/accepted_pn = 3700
2011-08-19 10:15:24.035412 7f4b5fb05760 store(/data/mon.bravo) get_int pgmap/last_committed = 82684
2011-08-19 10:15:24.035425 7f4b5fb05760 store(/data/mon.bravo) get_int pgmap/first_committed = 82684
2011-08-19 10:15:24.035438 7f4b5fb05760 store(/data/mon.bravo) get_int mdsmap/last_pn = 2201
2011-08-19 10:15:24.035454 7f4b5fb05760 store(/data/mon.bravo) get_int mdsmap/accepted_pn = 3700
2011-08-19 10:15:24.035466 7f4b5fb05760 store(/data/mon.bravo) get_int mdsmap/last_committed = 362
2011-08-19 10:15:24.035482 7f4b5fb05760 store(/data/mon.bravo) get_int osdmap/last_pn = 2201
2011-08-19 10:15:24.035498 7f4b5fb05760 store(/data/mon.bravo) get_int osdmap/accepted_pn = 2700
2011-08-19 10:15:24.035511 7f4b5fb05760 store(/data/mon.bravo) get_int osdmap/last_committed = 7740
2011-08-19 10:15:24.035526 7f4b5fb05760 store(/data/mon.bravo) get_int logm/last_pn = 2201
2011-08-19 10:15:24.035538 7f4b5fb05760 store(/data/mon.bravo) get_int logm/accepted_pn = 2700
2011-08-19 10:15:24.035550 7f4b5fb05760 store(/data/mon.bravo) get_int logm/last_committed = 22598
2011-08-19 10:15:24.035562 7f4b5fb05760 store(/data/mon.bravo) get_int logm/first_committed = 22098
2011-08-19 10:15:24.035575 7f4b5fb05760 store(/data/mon.bravo) get_int monmap/last_pn = 2201
2011-08-19 10:15:24.035587 7f4b5fb05760 store(/data/mon.bravo) get_int monmap/accepted_pn = 2700
2011-08-19 10:15:24.035599 7f4b5fb05760 store(/data/mon.bravo) get_int monmap/last_committed = 1
2011-08-19 10:15:24.035615 7f4b5fb05760 store(/data/mon.bravo) get_int auth/last_pn = 2201
2011-08-19 10:15:24.035627 7f4b5fb05760 store(/data/mon.bravo) get_int auth/accepted_pn = 2700
2011-08-19 10:15:24.035639 7f4b5fb05760 store(/data/mon.bravo) get_int auth/last_committed = 243
2011-08-19 10:15:24.035668 7f4b5fb05760 mon.bravo@1(starting).auth v243 AuthMonitor::init() paxosv=243
2011-08-19 10:15:24.035718 7f4b5fb05760 store(/data/mon.bravo) reading at off 0 of 4787
2011-08-19 10:15:24.035733 7f4b5fb05760 store(/data/mon.bravo) get_bl auth/latest = 4787 bytes
2011-08-19 10:15:24.035745 7f4b5fb05760 mon.bravo@1(starting).auth v243 AuthMonitor::init() startup: loading summary e243
2011-08-19 10:15:24.036390 7f4b5fb05760 mon.bravo@1(starting).auth v243 AuthMonitor::update_from_paxos()
2011-08-19 10:15:24.036552 7f4b5fb05760 log [INF] : mon.bravo calling new monitor election
2011-08-19 10:15:24.036681 7f4b5fb05760 mon.bravo@1(starting) e1 call_election
2011-08-19 10:15:24.036692 7f4b5fb05760 mon.bravo@1(starting).elector(0) start -- can i be leader?
2011-08-19 10:15:24.036698 7f4b5fb05760 mon.bravo@1(starting).elector(0) bump_epoch 0 to 1
2011-08-19 10:15:24.036716 7f4b5fb05760 store(/data/mon.bravo) set_int election_epoch = 1
2011-08-19 10:15:24.048000 7f4b5fb05760 mon.bravo@1(starting) e1 starting_election 1
2011-08-19 10:15:24.048276 7f4b5be1d700 mon.bravo@1(starting) e1 have connection
2011-08-19 10:15:24.048339 7f4b5be1d700 mon.bravo@1(starting) e1 do not have session, making new one
2011-08-19 10:15:24.048392 7f4b5be1d700 mon.bravo@1(starting) e1 ms_dispatch new session MonSession: mon1 192.168.101.113:6789/0 is open for mon1 192.168.101.113:6789/0
2011-08-19 10:15:24.048400 7f4b5be1d700 mon.bravo@1(starting) e1 setting monitor caps on this connection
2011-08-19 10:15:24.048410 7f4b5be1d700 mon.bravo@1(starting) e1 caps allow
2011-08-19 10:15:24.048725 7f4b5fb01700 mon.bravo@1(starting) e1 ms_get_authorizer for mon
2011-08-19 10:15:24.048782 7f4b5ae1b700 mon.bravo@1(starting) e1 ms_get_authorizer for mon
2011-08-19 10:15:24.048934 7f4b5fb01700 mon.bravo@1(starting) e1 ms_get_authorizer for mon
2011-08-19 10:15:24.048977 7f4b5ae1b700 mon.bravo@1(starting) e1 ms_get_authorizer for mon
2011-08-19 10:15:24.056802 7f4b5be1d700 mon.bravo@1(starting) e1 have connection
2011-08-19 10:15:24.056823 7f4b5be1d700 mon.bravo@1(starting) e1 do not have session, making new one
2011-08-19 10:15:24.056853 7f4b5be1d700 mon.bravo@1(starting) e1 ms_dispatch new session MonSession: mon2 192.168.101.115:6789/0 is open for mon2 192.168.101.115:6789/0
2011-08-19 10:15:24.056861 7f4b5be1d700 mon.bravo@1(starting) e1 setting monitor caps on this connection
2011-08-19 10:15:24.056868 7f4b5be1d700 mon.bravo@1(starting) e1 caps allow *
2011-08-19 10:15:24.056905 7f4b5be1d700 mon.bravo@1(starting).elector(1) handle_propose from mon2
2011-08-19 10:15:24.056914 7f4b5be1d700 mon.bravo@1(starting).elector(1) bump_epoch 1 to 5
2011-08-19 10:15:24.056921 7f4b5be1d700 store(/data/mon.bravo) set_int election_epoch = 5
2011-08-19 10:15:24.071572 7f4b5be1d700 mon.bravo@1(starting) e1 call_election
2011-08-19 10:15:24.071589 7f4b5be1d700 mon.bravo@1(starting).elector(5) start -- can i be leader?
2011-08-19 10:15:24.071598 7f4b5be1d700 mon.bravo@1(starting) e1 starting_election 5
2011-08-19 10:15:24.071734 7f4b5be1d700 mon.bravo@1(starting) e1 have connection
2011-08-19 10:15:24.071751 7f4b5be1d700 mon.bravo@1(starting) e1 ms_dispatch existing session MonSession: mon2 192.168.101.115:6789/0 is openallow * for mon2 192.168.101.115:6789/0
2011-08-19 10:15:24.071757 7f4b5be1d700 mon.bravo@1(starting) e1 caps allow *
2011-08-19 10:15:24.071782 7f4b5be1d700 mon.bravo@1(starting).elector(5) handle_ack from mon2
2011-08-19 10:15:24.071792 7f4b5be1d700 mon.bravo@1(starting).elector(5) so far i have 1,2
2011-08-19 10:15:24.071807 7f4b5be1d700 mon.bravo@1(starting) e1 have connection
2011-08-19 10:15:24.071815 7f4b5be1d700 mon.bravo@1(starting) e1 do not have session, making new one
2011-08-19 10:15:24.071828 7f4b5be1d700 mon.bravo@1(starting) e1 ms_dispatch new session MonSession: mon0 192.168.101.112:6789/0 is open for mon0 192.168.101.112:6789/0
2011-08-19 10:15:24.071834 7f4b5be1d700 mon.bravo@1(starting) e1 setting monitor caps on this connection
2011-08-19 10:15:24.071841 7f4b5be1d700 mon.bravo@1(starting) e1 caps allow *
2011-08-19 10:15:24.071862 7f4b5be1d700 mon.bravo@1(starting).elector(5) handle_propose from mon0
2011-08-19 10:15:24.071870 7f4b5be1d700 mon.bravo@1(starting).elector(5) defer to 0
2011-08-19 10:15:24.083021 7f4b5be1d700 mon.bravo@1(starting) e1 have connection
2011-08-19 10:15:24.083048 7f4b5be1d700 mon.bravo@1(starting) e1 ms_dispatch existing session MonSession: mon0 192.168.101.112:6789/0 is openallow * for mon0 192.168.101.112:6789/0
2011-08-19 10:15:24.083058 7f4b5be1d700 mon.bravo@1(starting) e1 caps allow *
2011-08-19 10:15:24.083084 7f4b5be1d700 mon.bravo@1(starting).elector(5) handle_victory from mon0
2011-08-19 10:15:24.083093 7f4b5be1d700 mon.bravo@1(starting).elector(5) bump_epoch 5 to 6
2011-08-19 10:15:24.083100 7f4b5be1d700 store(/data/mon.bravo) set_int election_epoch = 6
2011-08-19 10:15:24.093916 7f4b5be1d700 mon.bravo@1(peon) e1 lose_election, epoch 6 leader is mon0 quorum is 0,1,2
2011-08-19 10:15:24.093942 7f4b5be1d700 mon.bravo@1(peon) e1 resend_routed_requests
2011-08-19 10:15:24.093964 7f4b5be1d700 mon.bravo@1(peon) e1 have connection
2011-08-19 10:15:24.093978 7f4b5be1d700 mon.bravo@1(peon) e1 ms_dispatch existing session MonSession: mon0 192.168.101.112:6789/0 is openallow * for mon0 192.168.101.112:6789/0
2011-08-19 10:15:24.093984 7f4b5be1d700 mon.bravo@1(peon) e1 caps allow *
2011-08-19 10:15:24.094004 7f4b5be1d700 store(/data/mon.bravo) set_int pgmap/accepted_pn = 3900
2011-08-19 10:15:24.106105 7f4b5be1d700 store(/data/mon.bravo) exists_bl pgmap/82685
2011-08-19 10:15:24.106153 7f4b5be1d700 mon.bravo@1(peon) e1 have connection
2011-08-19 10:15:24.106169 7f4b5be1d700 mon.bravo@1(peon) e1 ms_dispatch existing session MonSession: mon0 192.168.101.112:6789/0 is openallow * for mon0 192.168.101.112:6789/0
2011-08-19 10:15:24.106176 7f4b5be1d700 mon.bravo@1(peon) e1 caps allow *
2011-08-19 10:15:24.106189 7f4b5be1d700 store(/data/mon.bravo) set_int mdsmap/accepted_pn = 3900
2011-08-19 10:15:24.116928 7f4b5be1d700 store(/data/mon.bravo) exists_bl mdsmap/363
2011-08-19 10:15:24.116955 7f4b5be1d700 store(/data/mon.bravo) get_bl mdsmap/363 = 0 bytes
../../src/mon/Paxos.cc: In function 'void Paxos::handle_collect(MMonPaxos
)', in thread '0x7f4b5be1d700'
../../src/mon/Paxos.cc: 145: FAILED assert(bl.length() > 0)
ceph version (commit:)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x89) [0x7b4751]
2: (Paxos::handle_collect(MMonPaxos*)+0x697) [0x64ed05]
3: (Paxos::dispatch(PaxosServiceMessage*)+0x22c) [0x6555b4]
4: (Monitor::_ms_dispatch(Message*)+0xd23) [0x62fd9d]
5: (Monitor::ms_dispatch(Message*)+0x3a) [0x63726c]
6: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0x780622]
7: (SimpleMessenger::dispatch_entry()+0x810) [0x76a264]
8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x61c8ae]
9: (Thread::_entry_func(void*)+0x23) [0x6f07d5]
10: (()+0x6d8c) [0x7f4b5f6dbd8c]
11: (clone()+0x6d) [0x7f4b5e12104d]
ceph version (commit:)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x89) [0x7b4751]
2: (Paxos::handle_collect(MMonPaxos*)+0x697) [0x64ed05]
3: (Paxos::dispatch(PaxosServiceMessage*)+0x22c) [0x6555b4]
4: (Monitor::_ms_dispatch(Message*)+0xd23) [0x62fd9d]
5: (Monitor::ms_dispatch(Message*)+0x3a) [0x63726c]
6: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0x780622]
7: (SimpleMessenger::dispatch_entry()+0x810) [0x76a264]
8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x61c8ae]
9: (Thread::_entry_func(void*)+0x23) [0x6f07d5]
10: (()+0x6d8c) [0x7f4b5f6dbd8c]
11: (clone()+0x6d) [0x7f4b5e12104d]
  • Caught signal (Aborted) *
    in thread 0x7f4b5be1d700
    ceph version (commit:)
    1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x79b3ed]
    2: /usr/ceph/bin/cmon() [0x7b4dc3]
    3: (()+0xfc60) [0x7f4b5f6e4c60]
    4: (gsignal()+0x35) [0x7f4b5e06ed05]
    5: (abort()+0x186) [0x7f4b5e072ab6]
    6: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f4b5e9256dd]
    7: (()+0xb9926) [0x7f4b5e923926]
    8: (()+0xb9953) [0x7f4b5e923953]
    9: (()+0xb9a5e) [0x7f4b5e923a5e]
    10: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x1f3) [0x7b48bb]
    11: (Paxos::handle_collect(MMonPaxos*)+0x697) [0x64ed05]
    12: (Paxos::dispatch(PaxosServiceMessage*)+0x22c) [0x6555b4]
    13: (Monitor::_ms_dispatch(Message*)+0xd23) [0x62fd9d]
    14: (Monitor::ms_dispatch(Message*)+0x3a) [0x63726c]
    15: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0x780622]
    16: (SimpleMessenger::dispatch_entry()+0x810) [0x76a264]
    17: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x61c8ae]
    18: (Thread::_entry_func(void*)+0x23) [0x6f07d5]
    19: (()+0x6d8c) [0x7f4b5f6dbd8c]
    20: (clone()+0x6d) [0x7f4b5e12104d]
Actions #2

Updated by Sage Weil over 12 years ago

I think i see the problem. Almost everywhere we write to a .new file, fsync, and then rename over. Except on place, where we write a big set of files and then do sync(2). If we crash in the middle of that, we can get 0-length files with their final names.

The quick fix is to just remove that file (and anything else in that dir with a higher number, which are probably also 0 length) and restart cmon.

Actions #3

Updated by Sage Weil over 12 years ago

  • Category set to Monitor
  • Status changed from New to Resolved
  • Assignee set to Sage Weil
  • Target version set to v0.34
Actions

Also available in: Atom PDF