Bug #2183
osd: lockdep cycle with obc lock and watch_lock
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
------------------------------------ existing dependency ReplicatedPG::ObjectContext::lock (53) -> OSD::watch_lock (41) at: ceph version 0.43-252-gee55a58 (commit:ee55a581d24663c00af19f3dceb747d95293d9a1) 2012-03-16 12:26:34.403355 1: (Mutex::Lock(bool)+0x41) [0xb04c97] 2012-03-16 12:26:34.403365 2: (OSD::disconnect_session_watches(OSD::Session*)+0x2cc) [0xb859ba] 2012-03-16 12:26:34.403372 3: (OSD::ms_handle_reset(Connection*)+0x11a) [0xb86052] 2012-03-16 12:26:34.403379 4: (Messenger::ms_deliver_handle_reset(Connection*)+0x4b) [0xcc4e11] 2012-03-16 12:26:34.403386 5: (SimpleMessenger::dispatch_entry()+0x45b) [0xcae68f] 2012-03-16 12:26:34.403392 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0xaaff62] 2012-03-16 12:26:34.403398 7: (Thread::_entry_func(void*)+0x23) [0xccd471] 2012-03-16 12:26:34.403411 8: (()+0x6b50) [0x7f89d88c5b50] 2012-03-16 12:26:34.403418 9: (clone()+0x6d) [0x7f89d6ef990d] 2012-03-16 12:26:34.403422 2012-03-16 12:26:34.403430 7f89d2bff700 new dependency OSD::watch_lock (41) -> ReplicatedPG::ObjectContext::lock (53) creates a cycle at ceph version 0.43-252-gee55a58 (commit:ee55a581d24663c00af19f3dceb747d95293d9a1) 2012-03-16 12:26:34.403437 1: (Mutex::Lock(bool)+0x41) [0xb04c97] 2012-03-16 12:26:34.403443 2: (ReplicatedPG::ObjectContext::ondisk_write_lock()+0x23) [0xb0d603] 2012-03-16 12:26:34.403450 3: (ReplicatedPG::apply_repop(ReplicatedPG::RepGather*)+0x1f4) [0xadda8c] 2012-03-16 12:26:34.403456 4: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x3ed) [0xadef37] 2012-03-16 12:26:34.403471 5: (ReplicatedPG::handle_watch_timeout(void*, entity_name_t, utime_t)+0x669) [0xae1d65] 2012-03-16 12:26:34.403479 6: (OSD::handle_watch_timeout(void*, ReplicatedPG*, entity_name_t, utime_t)+0x8a) [0xb856d4] 2012-03-16 12:26:34.403486 7: (Watch::C_WatchTimeout::finish(int)+0x45) [0xbf702f] 2012-03-16 12:26:34.403492 8: (SafeTimer::timer_thread()+0x291) [0xcce0a3] 2012-03-16 12:26:34.403499 9: (SafeTimerThread::entry()+0x1c) [0xccf0c4] 2012-03-16 12:26:34.403505 10: (Thread::_entry_func(void*)+0x23) [0xccd471] 2012-03-16 12:26:34.403511 11: (()+0x6b50) [0x7f89d88c5b50] 2012-03-16 12:26:34.403517 12: (clone()+0x6d) [0x7f89d6ef990d] 2012-03-16 12:26:34.403521 2012-03-16 12:26:34.403526 7f89d2bff700 btw, i am holding these locks: 2012-03-16 12:26:34.403532 7f89d2bff700 OSD::watch_lock (41) 2012-03-16 12:26:34.403538 7f89d2bff700 PG::_lock (46) 2012-03-16 12:26:34.403542 7f89d2bff700 common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int)' thread 7f89d2bff700 time 2012-03-16 12:26:34.403553 common/lockdep.cc: 201: FAILED assert(0) ceph version 0.43-252-gee55a58 (commit:ee55a581d24663c00af19f3dceb747d95293d9a1) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xae) [0xcd8586] 2: (lockdep_will_lock(char const*, int)+0x8f8) [0xd5195f] 3: (Mutex::_will_lock()+0x24) [0xb04bca] 4: (Mutex::Lock(bool)+0x41) [0xb04c97] 5: (ReplicatedPG::ObjectContext::ondisk_write_lock()+0x23) [0xb0d603] 6: (ReplicatedPG::apply_repop(ReplicatedPG::RepGather*)+0x1f4) [0xadda8c] 7: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x3ed) [0xadef37] 8: (ReplicatedPG::handle_watch_timeout(void*, entity_name_t, utime_t)+0x669) [0xae1d65] 9: (OSD::handle_watch_timeout(void*, ReplicatedPG*, entity_name_t, utime_t)+0x8a) [0xb856d4] 10: (Watch::C_WatchTimeout::finish(int)+0x45) [0xbf702f] 11: (SafeTimer::timer_thread()+0x291) [0xcce0a3] 12: (SafeTimerThread::entry()+0x1c) [0xccf0c4] 13: (Thread::_entry_func(void*)+0x23) [0xccd471] 14: (()+0x6b50) [0x7f89d88c5b50] 15: (clone()+0x6d) [0x7f89d6ef990d] ceph version 0.43-252-gee55a58 (commit:ee55a581d24663c00af19f3dceb747d95293d9a1) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xae) [0xcd8586] 2: (lockdep_will_lock(char const*, int)+0x8f8) [0xd5195f] 3: (Mutex::_will_lock()+0x24) [0xb04bca] 4: (Mutex::Lock(bool)+0x41) [0xb04c97] 5: (ReplicatedPG::ObjectContext::ondisk_write_lock()+0x23) [0xb0d603] 6: (ReplicatedPG::apply_repop(ReplicatedPG::RepGather*)+0x1f4) [0xadda8c] 7: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x3ed) [0xadef37]
History
#1 Updated by Sage Weil over 11 years ago
crashed it with this mutl
Iteration 2512 Iteration 2513 Iteration 2514 Iteration 2515 Iteration 2516 Iteration 2517 Iteration 2518 Iteration 2519 Iteration 2520 Iteration 2521 Iteration 2522 Iteration 2523 Iteration 2524 Iteration 2525 test/multi_stress_watch.cc: In function 'int main(int, char**)' thread 7f3a07066760 time 2012-03-16 12:26:04.354547 test/multi_stress_watch.cc: 81: FAILED assert(!ret) ceph version 0.43-285-g0779e53 (commit:0779e53e9c17e06049bd7155bee813c5bb1086c9) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xae) [0x7f3a068c23ba] 2: (main()+0x3bf) [0x403aa3] 3: (__libc_start_main()+0xfd) [0x7f3a05697ead] 4: /home/sage/src/ceph/src/.libs/lt-multi_stress_watch() [0x403629] ceph version 0.43-285-g0779e53 (commit:0779e53e9c17e06049bd7155bee813c5bb1086c9) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xae) [0x7f3a068c23ba] 2: (main()+0x3bf) [0x403aa3] 3: (__libc_start_main()+0xfd) [0x7f3a05697ead] 4: /home/sage/src/ceph/src/.libs/lt-multi_stress_watch() [0x403629] terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted (core dumped)
also running ./test_stress_watch in another terminal, not sure if it was still going when the osd crashed. the completion timestamp was 12:27:
Iteration 9994 Iteration 9995 Iteration 9996 Iteration 9997 Iteration 9998 Iteration 9999 [ OK ] WatchStress.Stress1 (229287 ms) [----------] 1 test from WatchStress (229287 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (229287 ms total) [ PASSED ] 1 test. fatty:src 12:27 PM $
#2 Updated by Samuel Just over 11 years ago
I believe it's inappropriate to hold obc->lock there anyway, pg lock serves that purpose.
#3 Updated by Samuel Just over 11 years ago
- Assignee set to Samuel Just
#4 Updated by Samuel Just over 11 years ago
- Status changed from New to Resolved
pushed to master f923b840edec79df5791a7fb7fdec8b0b40f25f1