Project

General

Profile

Bug #574

timer: event cancellation apparently broken

Added by Sage Weil over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
11/11/2010
Due date:
% Done:

0%

Estimated time:
0.00 h
Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Just saw this on latest unstable, f8b3271f45cc4a87e3f3f212d22e3d34ff13da44

The monitor schedules a propose timer event, and then in Paxos::election_starting() calls cancel_events(). A few moments later the event goes off and (rightly) fails it's sanity checks...

2010-11-11 15:30:58.401605 7f23445ce710 mon.a@0(leader).paxosservice(pgmap)  setting propose timer with delay of 0.05
2010-11-11 15:30:58.408371 7f23445ce710 -- 10.0.1.252:6789/0 <== mon1 10.0.1.252:6790/0 2874 ==== election(propose 5) v1 ==== 481+0+0 (3459256886 0 0) 0x7f2338001cc0
2010-11-11 15:30:58.408406 7f23445ce710 mon.a@0(leader) e1 have connection
2010-11-11 15:30:58.408417 7f23445ce710 mon.a@0(leader) e1 ms_dispatch existing session MonSession: mon1 10.0.1.252:6790/0 is openallow * for mon1 10.0.1.252:6790/0
2010-11-11 15:30:58.408427 7f23445ce710 mon.a@0(leader) e1  caps allow *
2010-11-11 15:30:58.408472 7f23445ce710 mon.a@0(leader).elector(4) handle_propose from mon1
2010-11-11 15:30:58.408481 7f23445ce710 mon.a@0(leader).elector(4) bump_epoch 4 to 5
2010-11-11 15:30:58.408490 7f23445ce710 store(dev/mon.a) set_int election_epoch = 5
2010-11-11 15:30:58.559494 7f23445ce710 mon.a@0(leader) e1 call_election
2010-11-11 15:30:58.559515 7f23445ce710 mon.a@0(leader).elector(5) start -- can i be leader?
2010-11-11 15:30:58.559529 7f23445ce710 mon.a@0(leader) e1 starting_election 5
2010-11-11 15:30:58.559536 7f23445ce710 mon.a@0(starting).paxos(pgmap active lc 80) election_starting -- canceling timeouts
2010-11-11 15:30:58.559559 7f23445ce710 mon.a@0(starting).pg v80 _updated_stats for osd0 10.0.1.252:6802/28980
2010-11-11 15:30:58.559577 7f23445ce710 -- 10.0.1.252:6789/0 --> 10.0.1.252:6802/28980 -- pg_stats_ack(1 pgs) v1 -- ?+0 0x7f2338004920
2010-11-11 15:30:58.559621 7f23445ce710 mon.a@0(starting).paxos(mdsmap active lc 5) election_starting -- canceling timeouts
2010-11-11 15:30:58.559646 7f23445ce710 mon.a@0(starting).paxos(osdmap active lc 3) election_starting -- canceling timeouts
2010-11-11 15:30:58.559657 7f23445ce710 mon.a@0(starting).paxos(logm active lc 9) election_starting -- canceling timeouts
2010-11-11 15:30:58.559673 7f23445ce710 mon.a@0(starting).paxos(class active lc 2) election_starting -- canceling timeouts
2010-11-11 15:30:58.559685 7f23445ce710 mon.a@0(starting).paxos(monmap active lc 1) election_starting -- canceling timeouts
2010-11-11 15:30:58.559695 7f23445ce710 mon.a@0(starting).paxos(auth active lc 5) election_starting -- canceling timeouts
2010-11-11 15:30:58.559705 7f23445ce710 mon.a@0(starting).paxosservice(pgmap) election_starting
2010-11-11 15:30:58.559713 7f23445ce710 mon.a@0(starting).paxosservice(mdsmap) election_starting
2010-11-11 15:30:58.559721 7f23445ce710 mon.a@0(starting).paxosservice(osdmap) election_starting
2010-11-11 15:30:58.559727 7f23445ce710 mon.a@0(starting).paxosservice(logm) election_starting
2010-11-11 15:30:58.559732 7f23445ce710 mon.a@0(starting).paxosservice(class) election_starting
2010-11-11 15:30:58.559738 7f23445ce710 mon.a@0(starting).paxosservice(monmap) election_starting
2010-11-11 15:30:58.559743 7f23445ce710 mon.a@0(starting).paxosservice(auth) election_starting
2010-11-11 15:30:58.559771 7f23445ce710 -- 10.0.1.252:6789/0 --> mon1 10.0.1.252:6790/0 -- election(propose 5) v1 -- ?+0 0x7f2338003660
2010-11-11 15:30:58.559811 7f23445ce710 -- 10.0.1.252:6789/0 --> mon2 10.0.1.252:6791/0 -- election(propose 5) v1 -- ?+0 0x7f2338003860
2010-11-11 15:30:58.559886 7f2343dcd710 mon.a@0(starting).paxosservice(pgmap) propose_pending
2010-11-11 15:30:58.560511 7f23445ce710 -- 10.0.1.252:6789/0 <== mon1 10.0.1.252:6790/0 2875 ==== election(ack 5) v1 ==== 481+0+0 (581609027 0 0) 0x7f2338001210
mon/PaxosService.cc: In function 'void PaxosService::propose_pending()':
mon/PaxosService.cc:128: FAILED assert(mon->is_leader() && paxos->is_active())
 ceph version 0.24~rc (commit:f8b3271f45cc4a87e3f3f212d22e3d34ff13da44)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x39) [0x71ec7e]
 2: (PaxosService::propose_pending()+0xcb) [0x621017]
 3: (PaxosService::C_Propose::finish(int)+0x2f) [0x621635]
 4: (TimerThread::entry()+0x397) [0x71cd71]
 5: (Thread::_entry_func(void*)+0x23) [0x5dffbb]
 6: (()+0x68ba) [0x7f2346af48ba]
 7: (clone()+0x6d) [0x7f2345abb01d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:f8b3271f45cc4a87e3f3f212d22e3d34ff13da44)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x71f085]
 2: (sigabrt_handler(int)+0x49) [0x72fe05]
 3: (()+0x321f0) [0x7f2345a1e1f0]
 4: (gsignal()+0x35) [0x7f2345a1e175]
 5: (abort()+0x180) [0x7f2345a20f80]
 6: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f23462b1dc5]
 7: (()+0xcb166) [0x7f23462b0166]
 8: (()+0xcb193) [0x7f23462b0193]
 9: (()+0xcb28e) [0x7f23462b028e]
 10: (ceph::__ceph_assert_warn(char const*, char const*, int, char const*)+0) [0x71eeb2]
 11: (PaxosService::propose_pending()+0xcb) [0x621017]
 12: (PaxosService::C_Propose::finish(int)+0x2f) [0x621635]
 13: (TimerThread::entry()+0x397) [0x71cd71]
 14: (Thread::_entry_func(void*)+0x23) [0x5dffbb]
 15: (()+0x68ba) [0x7f2346af48ba]
 16: (clone()+0x6d) [0x7f2345abb01d]

History

#1 Updated by Sage Weil over 8 years ago

  • Status changed from New to Resolved
  • Assignee changed from Colin McCabe to Sage Weil
  • Estimated time set to 0.00 h

Looking into this, it appears that the problem was that the wrong lock was taken during cancel event. Or that the event cancellation was fundamentally in some other way. I couldn't make heads of tails of things, though, so I just reimplemented SafeTimer without Timer in about a quarter of the original code and it is so much simpler!

ce6d63947fc9dc479376184e50683bdd9dd8cf81

#2 Updated by Colin McCabe over 8 years ago

cancel_event always relied on the caller to take the SafeTimer lock, and then goes on to take the Timer lock. So it's not a lack of locking, unless the caller didn't take the Monitor lock.

Anyway, it looks like the simplified timer code is working well now. Thanks for looking at this.

Also available in: Atom PDF