Project

General

Profile

Bug #2103

osd: lockdep error on watch_lock

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/nightly_coverage_2012-02-25-a/13773

------------------------------------
existing dependency OSD::map_lock (33) -> OSD::watch_lock (42) at:
 ceph version 0.42.2-168-g266902a (commit:266902a993c8548cc3c32f41be6450ecd78c475b)
2012-02-25 03:20:51.004758 1: (ReplicatedPG::context_registry_on_change()+0x1a) [0x4fe53a]
2012-02-25 03:20:51.004804 2: (ReplicatedPG::on_change()+0xf8) [0x514278]
2012-02-25 03:20:51.004878 3: (PG::start_peering_interval(std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&)+0x5bd) [0x73eded]
2012-02-25 03:20:51.004893 4: (PG::RecoveryState::Reset::react(PG::RecoveryState::AdvMap const&)+0x2c7) [0x73f917]
2012-02-25 03:20:51.004926 5: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1db) [0x754dbb]
2012-02-25 03:20:51.004957 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x16b) [0x74e80b]
2012-02-25 03:20:51.004972 7: (PG::RecoveryState::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, PG::RecoveryCtx*)+0x36c) [0x70c07c]
2012-02-25 03:20:51.004985 8: (OSD::advance_map(ObjectStore::Transaction&)+0x23c0) [0x593160]
2012-02-25 03:20:51.004998 9: (OSD::handle_osd_map(MOSDMap*)+0x24a0) [0x5b2350]
2012-02-25 03:20:51.005009 10: (OSD::_dispatch(Message*)+0x30b) [0x5c146b]
2012-02-25 03:20:51.005021 11: (OSD::ms_dispatch(Message*)+0x1af) [0x5c1a3f]
2012-02-25 03:20:51.005033 12: (SimpleMessenger::dispatch_entry()+0x89a) [0x60e9aa]
2012-02-25 03:20:51.005045 13: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4f2a7c]
2012-02-25 03:20:51.005062 14: (()+0x7971) [0x7f283ea4d971]
2012-02-25 03:20:51.005073 15: (clone()+0x6d) [0x7f283d0d892d]
2012-02-25 03:20:51.005083 2012-02-25 03:20:51.005107 7f2838dd0700 new dependency OSD::watch_lock (42) -> OSD::map_lock (33) creates a cycle at
 ceph version 0.42.2-168-g266902a (commit:266902a993c8548cc3c32f41be6450ecd78c475b)
2012-02-25 03:20:51.005120 1: (SafeTimer::timer_thread()+0x33b) [0x669abb]
2012-02-25 03:20:51.005131 2: (SafeTimerThread::entry()+0xd) [0x66c46d]
2012-02-25 03:20:51.005142 3: (()+0x7971) [0x7f283ea4d971]
2012-02-25 03:20:51.005153 4: (clone()+0x6d) [0x7f283d0d892d]
2012-02-25 03:20:51.005163 2012-02-25 03:20:51.005173 7f2838dd0700 btw, i am holding these locks:
2012-02-25 03:20:51.005184 7f2838dd0700   OSD::watch_lock (42)
2012-02-25 03:20:51.005194 7f2838dd0700 

common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int)' thread 7f2838dd0700 time 2012-02-25 03:20:51.005207
common/lockdep.cc: 201: FAILED assert(0)
 ceph version 0.42.2-168-g266902a (commit:266902a993c8548cc3c32f41be6450ecd78c475b)
 1: (lockdep_will_lock(char const*, int)+0xe1e) [0x5e2c0e]
 2: (PG::lock(bool)+0x11a) [0x7060ea]
 3: (OSD::handle_watch_timeout(void*, ReplicatedPG*, entity_name_t, utime_t)+0x2e) [0x579cee]
 4: (SafeTimer::timer_thread()+0x33b) [0x669abb]
 5: (SafeTimerThread::entry()+0xd) [0x66c46d]
 6: (()+0x7971) [0x7f283ea4d971]
 7: (clone()+0x6d) [0x7f283d0d892d]
 ceph version 0.42.2-168-g266902a (commit:266902a993c8548cc3c32f41be6450ecd78c475b)
 1: (lockdep_will_lock(char const*, int)+0xe1e) [0x5e2c0e]
 2: (PG::lock(bool)+0x11a) [0x7060ea]
 3: (OSD::handle_watch_timeout(void*, ReplicatedPG*, entity_name_t, utime_t)+0x2e) [0x579cee]
 4: (SafeTimer::timer_thread()+0x33b) [0x669abb]
 5: (SafeTimerThread::entry()+0xd) [0x66c46d]
 6: (()+0x7971) [0x7f283ea4d971]
 7: (clone()+0x6d) [0x7f283d0d892d]

Associated revisions

Revision e43546de (diff)
Added by Sage Weil about 12 years ago

osd: fix watch_lock vs map_lock ordering

watch_lock is inside map_lock (and pg->lock), which means we need to
drop it to take pg->lock here. That means verifying in
handle_watch_timeout that we haven't raced with another thread canceling
the timeout event, which would be indicated by

- the entity not appearing in unconnected_watchers
- the entity having a different (presumably newer) expire time

Fixes: #2103
Signed-off-by: Sage Weil <>
Reviewed-by: Samuel Just <>

History

#1 Updated by Sage Weil about 12 years ago

  • Priority changed from High to Normal

#2 Updated by Sage Weil about 12 years ago

  • Status changed from New to 12

must reenable this in qa suite when it's fixed!

#3 Updated by Sage Weil about 12 years ago

  • Target version changed from v0.43 to v0.44

#4 Updated by Sage Weil about 12 years ago

  • Priority changed from Normal to High

#5 Updated by Sage Weil about 12 years ago

  • Status changed from 12 to In Progress
  • Assignee set to Sage Weil

#6 Updated by Sage Weil about 12 years ago

  • Status changed from In Progress to Fix Under Review

#7 Updated by Sage Weil about 12 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF