Project

General

Profile

Actions

Bug #9898

closed

osd: fast dispatch deadlock in mark_down (giant)

Added by Sage Weil over 9 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

this is basically a dup of the issue we saw with fast dispach in the objecter, but with the osd.

Thread 12 (Thread 0x7f6c90fc5700 (LWP 31158)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f6c976c2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f6c976c2480 in __GI___pthread_mutex_lock (mutex=0x4186b88) at ../nptl/pthread_mutex_lock.c:79
#3  0x0000000000b32cef in Mutex::Lock (this=this@entry=0x4186b78, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:91
#4  0x0000000000b54e81 in SimpleMessenger::mark_down (this=0x4186700, con=0x67cfde0) at msg/SimpleMessenger.cc:636
#5  0x0000000000669f39 in OSD::require_same_peer_instance (this=this@entry=0x4818000, op=..., map=..., is_fast_dispatch=is_fast_dispatch@entry=true) at osd/OSD.cc:6764
#6  0x00000000006e0f15 in OSD::handle_replica_op<MOSDPGPull, 106> (this=this@entry=0x4818000, op=..., osdmap=...) at osd/OSD.cc:8160
#7  0x000000000069ae1e in OSD::dispatch_op_fast (this=this@entry=0x4818000, op=..., osdmap=...) at osd/OSD.cc:5758
#8  0x000000000069afb8 in OSD::dispatch_session_waiting (this=this@entry=0x4818000, session=session@entry=0x6866800, osdmap=...) at osd/OSD.cc:5402
#9  0x000000000069b39e in OSD::ms_fast_dispatch (this=0x4818000, m=<optimized out>) at osd/OSD.cc:5512
#10 0x0000000000c21db6 in ms_fast_dispatch (m=0x513b600, this=0x4186700) at msg/Messenger.h:503
#11 DispatchQueue::fast_dispatch (this=0x41868b8, m=0x513b600) at msg/DispatchQueue.cc:71
#12 0x0000000000c46836 in Pipe::reader (this=0x5024c00) at msg/Pipe.cc:1591
#13 0x0000000000c4f4ad in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:50
#14 0x00007f6c976c0182 in start_thread (arg=0x7f6c90fc5700) at pthread_create.c:312
#15 0x00007f6c95c2c38d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

vs
Thread 59 (Thread 0x7f6c84cba700 (LWP 29734)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000c347e5 in Wait (mutex=..., this=0x5024e18) at ./common/Cond.h:55
#2  Pipe::stop_and_wait (this=this@entry=0x5024c00) at msg/Pipe.cc:1437
#3  0x0000000000b54f08 in SimpleMessenger::mark_down (this=0x4186700, con=<optimized out>) at msg/SimpleMessenger.cc:643
#4  0x0000000000669f39 in OSD::require_same_peer_instance (this=this@entry=0x4818000, op=..., map=..., is_fast_dispatch=is_fast_dispatch@entry=false) at osd/OSD.cc:6764
#5  0x000000000067829e in OSD::require_same_or_newer_map (this=this@entry=0x4818000, op=..., epoch=207, is_fast_dispatch=is_fast_dispatch@entry=false) at osd/OSD.cc:6808
#6  0x00000000006a0ef7 in OSD::handle_pg_log (this=0x4818000, op=...) at osd/OSD.cc:7347
#7  0x00000000006a3678 in OSD::dispatch_op (this=this@entry=0x4818000, op=...) at osd/OSD.cc:5696
#8  0x00000000006a8ae8 in OSD::_dispatch (this=this@entry=0x4818000, m=m@entry=0x4652300) at osd/OSD.cc:5843
#9  0x00000000006a91a7 in OSD::ms_dispatch (this=0x4818000, m=0x4652300) at osd/OSD.cc:5386
#10 0x0000000000c22d69 in ms_deliver_dispatch (m=0x4652300, this=0x4186700) at msg/Messenger.h:532
#11 DispatchQueue::entry (this=0x41868b8) at msg/DispatchQueue.cc:185
#12 0x0000000000b5f0bd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:104
#13 0x00007f6c976c0182 in start_thread (arg=0x7f6c84cba700) at pthread_create.c:312
#14 0x00007f6c95c2c38d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

also

Thread 56 (Thread 0x7f6c834b7700 (LWP 29737)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f6c976c2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f6c976c2480 in __GI___pthread_mutex_lock (mutex=0x4186b88) at ../nptl/pthread_mutex_lock.c:79
#3  0x0000000000b32cef in Mutex::Lock (this=this@entry=0x4186b78, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:91
#4  0x0000000000b5a64b in Locker (m=..., this=<synthetic pointer>) at ./common/Mutex.h:115
#5  SimpleMessenger::get_connection (this=0x4186700, dest=...) at msg/SimpleMessenger.cc:385
#6  0x00000000006628e2 in OSDService::get_con_osd_cluster (this=this@entry=0x4819710, peer=1, from_epoch=<optimized out>) at osd/OSD.cc:700
#7  0x00000000006884bd in OSD::handle_osd_ping (this=this@entry=0x4818000, m=m@entry=0x62fcee0) at osd/OSD.cc:3763
#8  0x0000000000689aab in OSD::heartbeat_dispatch (this=0x4818000, m=0x62fcee0) at osd/OSD.cc:5344
#9  0x0000000000c22d69 in ms_deliver_dispatch (m=0x62fcee0, this=0x4188300) at msg/Messenger.h:532
#10 DispatchQueue::entry (this=0x41884b8) at msg/DispatchQueue.cc:185
#11 0x0000000000b5f0bd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:104
#12 0x00007f6c976c0182 in start_thread (arg=0x7f6c834b7700) at pthread_create.c:312
#13 0x00007f6c95c2c38d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

full thread dump attached


Files

a (57.5 KB) a Sage Weil, 10/26/2014 12:18 PM

Related issues 2 (0 open2 closed)

Related to Ceph - Feature #9598: re-enable Objecter fast dispatchResolvedSage Weil09/25/2014

Actions
Has duplicate Ceph - Bug #9895: Master/giant branch: OSD deadlock during recoveryDuplicate10/26/2014

Actions
Actions #1

Updated by Sage Weil over 9 years ago

  • File a a added

full backtrace

Actions #2

Updated by Sage Weil over 9 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-10-24_21:12:40-rados-wip-sam-testing-distro-basic-multi/570144

Actions #3

Updated by Andrey Korolyov over 9 years ago

Looks like the same as I reported some hours before: #9895. Please close mine or this one as a duplicate.

Actions #4

Updated by Andrey Korolyov over 9 years ago

Sage, everyone - when will approximately this exact fix land into master? It effectively blocks our testing progress for giant/rocksdb right now.

Actions #5

Updated by Samuel Just over 9 years ago

  • Status changed from New to Resolved
Actions #6

Updated by Sage Weil over 9 years ago

  • Status changed from Resolved to Pending Backport
Actions #7

Updated by Loïc Dachary almost 9 years ago

  • Status changed from Pending Backport to Resolved
  • Regression set to No

It was intended for giant which is now retired.

Actions #8

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
Actions

Also available in: Atom PDF