Project

General

Profile

Actions

Bug #2183

closed

osd: lockdep cycle with obc lock and watch_lock

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

------------------------------------
existing dependency ReplicatedPG::ObjectContext::lock (53) -> OSD::watch_lock (41) at:
 ceph version 0.43-252-gee55a58 (commit:ee55a581d24663c00af19f3dceb747d95293d9a1)
2012-03-16 12:26:34.403355 1: (Mutex::Lock(bool)+0x41) [0xb04c97]
2012-03-16 12:26:34.403365 2: (OSD::disconnect_session_watches(OSD::Session*)+0x2cc) [0xb859ba]
2012-03-16 12:26:34.403372 3: (OSD::ms_handle_reset(Connection*)+0x11a) [0xb86052]
2012-03-16 12:26:34.403379 4: (Messenger::ms_deliver_handle_reset(Connection*)+0x4b) [0xcc4e11]
2012-03-16 12:26:34.403386 5: (SimpleMessenger::dispatch_entry()+0x45b) [0xcae68f]
2012-03-16 12:26:34.403392 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0xaaff62]
2012-03-16 12:26:34.403398 7: (Thread::_entry_func(void*)+0x23) [0xccd471]
2012-03-16 12:26:34.403411 8: (()+0x6b50) [0x7f89d88c5b50]
2012-03-16 12:26:34.403418 9: (clone()+0x6d) [0x7f89d6ef990d]
2012-03-16 12:26:34.403422 2012-03-16 12:26:34.403430 7f89d2bff700 new dependency OSD::watch_lock (41) -> ReplicatedPG::ObjectContext::lock (53) creates a cycle at
 ceph version 0.43-252-gee55a58 (commit:ee55a581d24663c00af19f3dceb747d95293d9a1)
2012-03-16 12:26:34.403437 1: (Mutex::Lock(bool)+0x41) [0xb04c97]
2012-03-16 12:26:34.403443 2: (ReplicatedPG::ObjectContext::ondisk_write_lock()+0x23) [0xb0d603]
2012-03-16 12:26:34.403450 3: (ReplicatedPG::apply_repop(ReplicatedPG::RepGather*)+0x1f4) [0xadda8c]
2012-03-16 12:26:34.403456 4: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x3ed) [0xadef37]
2012-03-16 12:26:34.403471 5: (ReplicatedPG::handle_watch_timeout(void*, entity_name_t, utime_t)+0x669) [0xae1d65]
2012-03-16 12:26:34.403479 6: (OSD::handle_watch_timeout(void*, ReplicatedPG*, entity_name_t, utime_t)+0x8a) [0xb856d4]
2012-03-16 12:26:34.403486 7: (Watch::C_WatchTimeout::finish(int)+0x45) [0xbf702f]
2012-03-16 12:26:34.403492 8: (SafeTimer::timer_thread()+0x291) [0xcce0a3]
2012-03-16 12:26:34.403499 9: (SafeTimerThread::entry()+0x1c) [0xccf0c4]
2012-03-16 12:26:34.403505 10: (Thread::_entry_func(void*)+0x23) [0xccd471]
2012-03-16 12:26:34.403511 11: (()+0x6b50) [0x7f89d88c5b50]
2012-03-16 12:26:34.403517 12: (clone()+0x6d) [0x7f89d6ef990d]
2012-03-16 12:26:34.403521 2012-03-16 12:26:34.403526 7f89d2bff700 btw, i am holding these locks:
2012-03-16 12:26:34.403532 7f89d2bff700   OSD::watch_lock (41)
2012-03-16 12:26:34.403538 7f89d2bff700   PG::_lock (46)
2012-03-16 12:26:34.403542 7f89d2bff700 

common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int)' thread 7f89d2bff700 time 2012-03-16 12:26:34.403553
common/lockdep.cc: 201: FAILED assert(0)
 ceph version 0.43-252-gee55a58 (commit:ee55a581d24663c00af19f3dceb747d95293d9a1)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xae) [0xcd8586]
 2: (lockdep_will_lock(char const*, int)+0x8f8) [0xd5195f]
 3: (Mutex::_will_lock()+0x24) [0xb04bca]
 4: (Mutex::Lock(bool)+0x41) [0xb04c97]
 5: (ReplicatedPG::ObjectContext::ondisk_write_lock()+0x23) [0xb0d603]
 6: (ReplicatedPG::apply_repop(ReplicatedPG::RepGather*)+0x1f4) [0xadda8c]
 7: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x3ed) [0xadef37]
 8: (ReplicatedPG::handle_watch_timeout(void*, entity_name_t, utime_t)+0x669) [0xae1d65]
 9: (OSD::handle_watch_timeout(void*, ReplicatedPG*, entity_name_t, utime_t)+0x8a) [0xb856d4]
 10: (Watch::C_WatchTimeout::finish(int)+0x45) [0xbf702f]
 11: (SafeTimer::timer_thread()+0x291) [0xcce0a3]
 12: (SafeTimerThread::entry()+0x1c) [0xccf0c4]
 13: (Thread::_entry_func(void*)+0x23) [0xccd471]
 14: (()+0x6b50) [0x7f89d88c5b50]
 15: (clone()+0x6d) [0x7f89d6ef990d]
 ceph version 0.43-252-gee55a58 (commit:ee55a581d24663c00af19f3dceb747d95293d9a1)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xae) [0xcd8586]
 2: (lockdep_will_lock(char const*, int)+0x8f8) [0xd5195f]
 3: (Mutex::_will_lock()+0x24) [0xb04bca]
 4: (Mutex::Lock(bool)+0x41) [0xb04c97]
 5: (ReplicatedPG::ObjectContext::ondisk_write_lock()+0x23) [0xb0d603]
 6: (ReplicatedPG::apply_repop(ReplicatedPG::RepGather*)+0x1f4) [0xadda8c]
 7: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x3ed) [0xadef37]

Actions #1

Updated by Sage Weil about 12 years ago

crashed it with this mutl

Iteration 2512
Iteration 2513
Iteration 2514
Iteration 2515
Iteration 2516
Iteration 2517
Iteration 2518
Iteration 2519
Iteration 2520
Iteration 2521
Iteration 2522
Iteration 2523
Iteration 2524
Iteration 2525
test/multi_stress_watch.cc: In function 'int main(int, char**)' thread 7f3a07066760 time 2012-03-16 12:26:04.354547
test/multi_stress_watch.cc: 81: FAILED assert(!ret)
 ceph version 0.43-285-g0779e53 (commit:0779e53e9c17e06049bd7155bee813c5bb1086c9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xae) [0x7f3a068c23ba]
 2: (main()+0x3bf) [0x403aa3]
 3: (__libc_start_main()+0xfd) [0x7f3a05697ead]
 4: /home/sage/src/ceph/src/.libs/lt-multi_stress_watch() [0x403629]
 ceph version 0.43-285-g0779e53 (commit:0779e53e9c17e06049bd7155bee813c5bb1086c9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xae) [0x7f3a068c23ba]
 2: (main()+0x3bf) [0x403aa3]
 3: (__libc_start_main()+0xfd) [0x7f3a05697ead]
 4: /home/sage/src/ceph/src/.libs/lt-multi_stress_watch() [0x403629]
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted (core dumped)

also running ./test_stress_watch in another terminal, not sure if it was still going when the osd crashed. the completion timestamp was 12:27:

Iteration 9994
Iteration 9995
Iteration 9996
Iteration 9997
Iteration 9998
Iteration 9999
[       OK ] WatchStress.Stress1 (229287 ms)
[----------] 1 test from WatchStress (229287 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (229287 ms total)
[  PASSED  ] 1 test.
fatty:src 12:27 PM $ 

Actions #2

Updated by Samuel Just about 12 years ago

I believe it's inappropriate to hold obc->lock there anyway, pg lock serves that purpose.

Actions #3

Updated by Samuel Just about 12 years ago

  • Assignee set to Samuel Just
Actions #4

Updated by Samuel Just about 12 years ago

  • Status changed from New to Resolved

pushed to master f923b840edec79df5791a7fb7fdec8b0b40f25f1

Actions

Also available in: Atom PDF