Bug #2103
osd: lockdep error on watch_lock
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
ubuntu@teuthology:/a/nightly_coverage_2012-02-25-a/13773 ------------------------------------ existing dependency OSD::map_lock (33) -> OSD::watch_lock (42) at: ceph version 0.42.2-168-g266902a (commit:266902a993c8548cc3c32f41be6450ecd78c475b) 2012-02-25 03:20:51.004758 1: (ReplicatedPG::context_registry_on_change()+0x1a) [0x4fe53a] 2012-02-25 03:20:51.004804 2: (ReplicatedPG::on_change()+0xf8) [0x514278] 2012-02-25 03:20:51.004878 3: (PG::start_peering_interval(std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&)+0x5bd) [0x73eded] 2012-02-25 03:20:51.004893 4: (PG::RecoveryState::Reset::react(PG::RecoveryState::AdvMap const&)+0x2c7) [0x73f917] 2012-02-25 03:20:51.004926 5: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1db) [0x754dbb] 2012-02-25 03:20:51.004957 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x16b) [0x74e80b] 2012-02-25 03:20:51.004972 7: (PG::RecoveryState::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, PG::RecoveryCtx*)+0x36c) [0x70c07c] 2012-02-25 03:20:51.004985 8: (OSD::advance_map(ObjectStore::Transaction&)+0x23c0) [0x593160] 2012-02-25 03:20:51.004998 9: (OSD::handle_osd_map(MOSDMap*)+0x24a0) [0x5b2350] 2012-02-25 03:20:51.005009 10: (OSD::_dispatch(Message*)+0x30b) [0x5c146b] 2012-02-25 03:20:51.005021 11: (OSD::ms_dispatch(Message*)+0x1af) [0x5c1a3f] 2012-02-25 03:20:51.005033 12: (SimpleMessenger::dispatch_entry()+0x89a) [0x60e9aa] 2012-02-25 03:20:51.005045 13: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4f2a7c] 2012-02-25 03:20:51.005062 14: (()+0x7971) [0x7f283ea4d971] 2012-02-25 03:20:51.005073 15: (clone()+0x6d) [0x7f283d0d892d] 2012-02-25 03:20:51.005083 2012-02-25 03:20:51.005107 7f2838dd0700 new dependency OSD::watch_lock (42) -> OSD::map_lock (33) creates a cycle at ceph version 0.42.2-168-g266902a (commit:266902a993c8548cc3c32f41be6450ecd78c475b) 2012-02-25 03:20:51.005120 1: (SafeTimer::timer_thread()+0x33b) [0x669abb] 2012-02-25 03:20:51.005131 2: (SafeTimerThread::entry()+0xd) [0x66c46d] 2012-02-25 03:20:51.005142 3: (()+0x7971) [0x7f283ea4d971] 2012-02-25 03:20:51.005153 4: (clone()+0x6d) [0x7f283d0d892d] 2012-02-25 03:20:51.005163 2012-02-25 03:20:51.005173 7f2838dd0700 btw, i am holding these locks: 2012-02-25 03:20:51.005184 7f2838dd0700 OSD::watch_lock (42) 2012-02-25 03:20:51.005194 7f2838dd0700 common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int)' thread 7f2838dd0700 time 2012-02-25 03:20:51.005207 common/lockdep.cc: 201: FAILED assert(0) ceph version 0.42.2-168-g266902a (commit:266902a993c8548cc3c32f41be6450ecd78c475b) 1: (lockdep_will_lock(char const*, int)+0xe1e) [0x5e2c0e] 2: (PG::lock(bool)+0x11a) [0x7060ea] 3: (OSD::handle_watch_timeout(void*, ReplicatedPG*, entity_name_t, utime_t)+0x2e) [0x579cee] 4: (SafeTimer::timer_thread()+0x33b) [0x669abb] 5: (SafeTimerThread::entry()+0xd) [0x66c46d] 6: (()+0x7971) [0x7f283ea4d971] 7: (clone()+0x6d) [0x7f283d0d892d] ceph version 0.42.2-168-g266902a (commit:266902a993c8548cc3c32f41be6450ecd78c475b) 1: (lockdep_will_lock(char const*, int)+0xe1e) [0x5e2c0e] 2: (PG::lock(bool)+0x11a) [0x7060ea] 3: (OSD::handle_watch_timeout(void*, ReplicatedPG*, entity_name_t, utime_t)+0x2e) [0x579cee] 4: (SafeTimer::timer_thread()+0x33b) [0x669abb] 5: (SafeTimerThread::entry()+0xd) [0x66c46d] 6: (()+0x7971) [0x7f283ea4d971] 7: (clone()+0x6d) [0x7f283d0d892d]
Associated revisions
osd: fix watch_lock vs map_lock ordering
watch_lock is inside map_lock (and pg->lock), which means we need to
drop it to take pg->lock here. That means verifying in
handle_watch_timeout that we haven't raced with another thread canceling
the timeout event, which would be indicated by
- the entity not appearing in unconnected_watchers
- the entity having a different (presumably newer) expire time
Fixes: #2103
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
History
#1 Updated by Sage Weil over 11 years ago
- Priority changed from High to Normal
#2 Updated by Sage Weil over 11 years ago
- Status changed from New to 12
must reenable this in qa suite when it's fixed!
#3 Updated by Sage Weil over 11 years ago
- Target version changed from v0.43 to v0.44
#4 Updated by Sage Weil over 11 years ago
- Priority changed from Normal to High
#5 Updated by Sage Weil over 11 years ago
- Status changed from 12 to In Progress
- Assignee set to Sage Weil
#6 Updated by Sage Weil over 11 years ago
- Status changed from In Progress to Fix Under Review
#7 Updated by Sage Weil over 11 years ago
- Status changed from Fix Under Review to Resolved