Bug #574
closedtimer: event cancellation apparently broken
0%
Description
Just saw this on latest unstable, f8b3271f45cc4a87e3f3f212d22e3d34ff13da44
The monitor schedules a propose timer event, and then in Paxos::election_starting() calls cancel_events(). A few moments later the event goes off and (rightly) fails it's sanity checks...
2010-11-11 15:30:58.401605 7f23445ce710 mon.a@0(leader).paxosservice(pgmap) setting propose timer with delay of 0.05 2010-11-11 15:30:58.408371 7f23445ce710 -- 10.0.1.252:6789/0 <== mon1 10.0.1.252:6790/0 2874 ==== election(propose 5) v1 ==== 481+0+0 (3459256886 0 0) 0x7f2338001cc0 2010-11-11 15:30:58.408406 7f23445ce710 mon.a@0(leader) e1 have connection 2010-11-11 15:30:58.408417 7f23445ce710 mon.a@0(leader) e1 ms_dispatch existing session MonSession: mon1 10.0.1.252:6790/0 is openallow * for mon1 10.0.1.252:6790/0 2010-11-11 15:30:58.408427 7f23445ce710 mon.a@0(leader) e1 caps allow * 2010-11-11 15:30:58.408472 7f23445ce710 mon.a@0(leader).elector(4) handle_propose from mon1 2010-11-11 15:30:58.408481 7f23445ce710 mon.a@0(leader).elector(4) bump_epoch 4 to 5 2010-11-11 15:30:58.408490 7f23445ce710 store(dev/mon.a) set_int election_epoch = 5 2010-11-11 15:30:58.559494 7f23445ce710 mon.a@0(leader) e1 call_election 2010-11-11 15:30:58.559515 7f23445ce710 mon.a@0(leader).elector(5) start -- can i be leader? 2010-11-11 15:30:58.559529 7f23445ce710 mon.a@0(leader) e1 starting_election 5 2010-11-11 15:30:58.559536 7f23445ce710 mon.a@0(starting).paxos(pgmap active lc 80) election_starting -- canceling timeouts 2010-11-11 15:30:58.559559 7f23445ce710 mon.a@0(starting).pg v80 _updated_stats for osd0 10.0.1.252:6802/28980 2010-11-11 15:30:58.559577 7f23445ce710 -- 10.0.1.252:6789/0 --> 10.0.1.252:6802/28980 -- pg_stats_ack(1 pgs) v1 -- ?+0 0x7f2338004920 2010-11-11 15:30:58.559621 7f23445ce710 mon.a@0(starting).paxos(mdsmap active lc 5) election_starting -- canceling timeouts 2010-11-11 15:30:58.559646 7f23445ce710 mon.a@0(starting).paxos(osdmap active lc 3) election_starting -- canceling timeouts 2010-11-11 15:30:58.559657 7f23445ce710 mon.a@0(starting).paxos(logm active lc 9) election_starting -- canceling timeouts 2010-11-11 15:30:58.559673 7f23445ce710 mon.a@0(starting).paxos(class active lc 2) election_starting -- canceling timeouts 2010-11-11 15:30:58.559685 7f23445ce710 mon.a@0(starting).paxos(monmap active lc 1) election_starting -- canceling timeouts 2010-11-11 15:30:58.559695 7f23445ce710 mon.a@0(starting).paxos(auth active lc 5) election_starting -- canceling timeouts 2010-11-11 15:30:58.559705 7f23445ce710 mon.a@0(starting).paxosservice(pgmap) election_starting 2010-11-11 15:30:58.559713 7f23445ce710 mon.a@0(starting).paxosservice(mdsmap) election_starting 2010-11-11 15:30:58.559721 7f23445ce710 mon.a@0(starting).paxosservice(osdmap) election_starting 2010-11-11 15:30:58.559727 7f23445ce710 mon.a@0(starting).paxosservice(logm) election_starting 2010-11-11 15:30:58.559732 7f23445ce710 mon.a@0(starting).paxosservice(class) election_starting 2010-11-11 15:30:58.559738 7f23445ce710 mon.a@0(starting).paxosservice(monmap) election_starting 2010-11-11 15:30:58.559743 7f23445ce710 mon.a@0(starting).paxosservice(auth) election_starting 2010-11-11 15:30:58.559771 7f23445ce710 -- 10.0.1.252:6789/0 --> mon1 10.0.1.252:6790/0 -- election(propose 5) v1 -- ?+0 0x7f2338003660 2010-11-11 15:30:58.559811 7f23445ce710 -- 10.0.1.252:6789/0 --> mon2 10.0.1.252:6791/0 -- election(propose 5) v1 -- ?+0 0x7f2338003860 2010-11-11 15:30:58.559886 7f2343dcd710 mon.a@0(starting).paxosservice(pgmap) propose_pending 2010-11-11 15:30:58.560511 7f23445ce710 -- 10.0.1.252:6789/0 <== mon1 10.0.1.252:6790/0 2875 ==== election(ack 5) v1 ==== 481+0+0 (581609027 0 0) 0x7f2338001210 mon/PaxosService.cc: In function 'void PaxosService::propose_pending()': mon/PaxosService.cc:128: FAILED assert(mon->is_leader() && paxos->is_active()) ceph version 0.24~rc (commit:f8b3271f45cc4a87e3f3f212d22e3d34ff13da44) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x39) [0x71ec7e] 2: (PaxosService::propose_pending()+0xcb) [0x621017] 3: (PaxosService::C_Propose::finish(int)+0x2f) [0x621635] 4: (TimerThread::entry()+0x397) [0x71cd71] 5: (Thread::_entry_func(void*)+0x23) [0x5dffbb] 6: (()+0x68ba) [0x7f2346af48ba] 7: (clone()+0x6d) [0x7f2345abb01d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (ABRT) *** ceph version 0.24~rc (commit:f8b3271f45cc4a87e3f3f212d22e3d34ff13da44) 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x71f085] 2: (sigabrt_handler(int)+0x49) [0x72fe05] 3: (()+0x321f0) [0x7f2345a1e1f0] 4: (gsignal()+0x35) [0x7f2345a1e175] 5: (abort()+0x180) [0x7f2345a20f80] 6: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f23462b1dc5] 7: (()+0xcb166) [0x7f23462b0166] 8: (()+0xcb193) [0x7f23462b0193] 9: (()+0xcb28e) [0x7f23462b028e] 10: (ceph::__ceph_assert_warn(char const*, char const*, int, char const*)+0) [0x71eeb2] 11: (PaxosService::propose_pending()+0xcb) [0x621017] 12: (PaxosService::C_Propose::finish(int)+0x2f) [0x621635] 13: (TimerThread::entry()+0x397) [0x71cd71] 14: (Thread::_entry_func(void*)+0x23) [0x5dffbb] 15: (()+0x68ba) [0x7f2346af48ba] 16: (clone()+0x6d) [0x7f2345abb01d]
Updated by Sage Weil over 13 years ago
- Status changed from New to Resolved
- Assignee changed from Colin McCabe to Sage Weil
- Estimated time set to 0:00 h
Looking into this, it appears that the problem was that the wrong lock was taken during cancel event. Or that the event cancellation was fundamentally in some other way. I couldn't make heads of tails of things, though, so I just reimplemented SafeTimer without Timer in about a quarter of the original code and it is so much simpler!
Updated by Colin McCabe over 13 years ago
cancel_event always relied on the caller to take the SafeTimer lock, and then goes on to take the Timer lock. So it's not a lack of locking, unless the caller didn't take the Monitor lock.
Anyway, it looks like the simplified timer code is working well now. Thanks for looking at this.