Project

General

Profile

Bug #5488

librbd: deadlock in image refresh

Added by Sage Weil over 5 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Target version:
-
Start date:
07/01/2013
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
hammer,firefly
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

(gdb) thr app all bt

Thread 26 (Thread 0x7f3a1491f700 (LWP 29993)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a1668d59b in ceph::log::Log::entry (this=0x830030) at log/Log.cc:323
#2  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a1491f700) at pthread_create.c:308
#3  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 25 (Thread 0x7f3a1411e700 (LWP 29994)):
#0  sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:102
#1  0x00007f3a1677c508 in CephContextServiceThread::entry (this=0x82eeb0) at common/ceph_context.cc:57
#2  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a1411e700) at pthread_create.c:308
#3  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 24 (Thread 0x7f3a1391d700 (LWP 29995)):
#0  0x00007f3a15e4c313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f3a1660ebdb in AdminSocket::entry (this=0x830580) at common/admin_socket.cc:230
#2  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a1391d700) at pthread_create.c:308
#3  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 23 (Thread 0x7f3a1311c700 (LWP 29996)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a166bbbf8 in Wait (mutex=..., this=0x8370f0) at ./common/Cond.h:55
#2  DispatchQueue::entry (this=0x837088) at msg/DispatchQueue.cc:137
#3  0x00007f3a1672581d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:104
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a1311c700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 22 (Thread 0x7f3a1291b700 (LWP 29997)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16720e09 in Wait (mutex=..., this=0x837480) at ./common/Cond.h:55
#2  SimpleMessenger::reaper_entry (this=0x836fa0) at msg/SimpleMessenger.cc:205
#3  0x00007f3a16725fad in SimpleMessenger::ReaperThread::entry (this=<optimized out>) at msg/SimpleMessenger.h:422
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a1291b700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 21 (Thread 0x7f3a1211a700 (LWP 29998)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
#1  0x00007f3a1662743b in WaitUntil (when=..., mutex=..., this=0x832998) at common/Cond.h:71
#2  SafeTimer::timer_thread (this=0x832988) at common/Timer.cc:114
#3  0x00007f3a16627ded in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a1211a700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 20 (Thread 0x7f3a11919700 (LWP 29999)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16628be0 in Wait (mutex=..., this=0x832ab0) at ./common/Cond.h:55
#2  Finisher::finisher_thread_entry (this=0x832a50) at common/Finisher.cc:80
#3  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a11919700) at pthread_create.c:308
#4  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()

Thread 19 (Thread 0x7f3a11017700 (LWP 30002)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
#1  0x00007f3a1662743b in WaitUntil (when=..., mutex=..., this=0x832dd8) at common/Cond.h:71
#2  SafeTimer::timer_thread (this=0x832dc8) at common/Timer.cc:114
#3  0x00007f3a16627ded in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a11017700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 18 (Thread 0x7f3a10816700 (LWP 30003)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16628be0 in Wait (mutex=..., this=0x832f18) at ./common/Cond.h:55
#2  Finisher::finisher_thread_entry (this=0x832eb8) at common/Finisher.cc:80
#3  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a10816700) at pthread_create.c:308
#4  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()

Thread 17 (Thread 0x7f3a039f9700 (LWP 32628)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16628be0 in Wait (mutex=..., this=0x84b6b0) at ./common/Cond.h:55
#2  Finisher::finisher_thread_entry (this=0x84b650) at common/Finisher.cc:80
#3  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a039f9700) at pthread_create.c:308
#4  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()

Thread 16 (Thread 0x7f3a031f8700 (LWP 32629)):
#0  pthread_rwlock_rdlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_rdlock.S:85
#1  0x00007f3a172b309b in get_read (this=0x87bb80) at ./common/RWLock.h:51
#2  librbd::LibrbdWriteback::write (this=0x84d6d0, oid=..., oloc=..., off=1503261, len=<optimized out>, snapc=..., bl=..., mtime=..., trunc_size=0, trunc_seq=0, oncommit=0x7f39f40008c0) at librbd/LibrbdWriteback.cc:143
#3  0x00007f3a172baa56 in ObjectCacher::bh_write (this=0x84b420, bh=0x863ce0) at osdc/ObjectCacher.cc:797
---Type <return> to continue, or q <return> to quit---
#4  0x00007f3a172bbc80 in ObjectCacher::flusher_entry (this=0x84b420) at osdc/ObjectCacher.cc:1434
#5  0x00007f3a172cc0bd in ObjectCacher::FlusherThread::entry (this=<optimized out>) at osdc/ObjectCacher.h:353
#6  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a031f8700) at pthread_create.c:308
#7  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#8  0x0000000000000000 in ?? ()

Thread 15 (Thread 0x7f3a024f2700 (LWP 32630)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16628be0 in Wait (mutex=..., this=0x8a8760) at ./common/Cond.h:55
#2  Finisher::finisher_thread_entry (this=0x8a8700) at common/Finisher.cc:80
#3  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a024f2700) at pthread_create.c:308
#4  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()

Thread 14 (Thread 0x7f3a01cf1700 (LWP 32631)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
#1  0x00007f3a172bb809 in WaitUntil (when=..., mutex=..., this=0x8a86a0) at ./common/Cond.h:71
#2  WaitInterval (interval=..., mutex=..., cct=<optimized out>, this=0x8a86a0) at ./common/Cond.h:79
#3  ObjectCacher::flusher_entry (this=0x8a84d0) at osdc/ObjectCacher.cc:1439
#4  0x00007f3a172cc0bd in ObjectCacher::FlusherThread::entry (this=<optimized out>) at osdc/ObjectCacher.h:353
#5  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a01cf1700) at pthread_create.c:308
#6  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 13 (Thread 0x7f3a004ee700 (LWP 939)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16710f8b in Wait (mutex=..., this=0x7f39fc006b88) at ./common/Cond.h:55
#2  Pipe::writer (this=0x7f39fc0069a0) at msg/Pipe.cc:1573
#3  0x00007f3a1671bccd in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a004ee700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 12 (Thread 0x7f3a11118700 (LWP 940)):
#0  0x00007f3a15e4c313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f3a16704e32 in Pipe::tcp_read_wait (this=<optimized out>) at msg/Pipe.cc:2107
#2  0x00007f3a16705150 in Pipe::tcp_read (this=0x7f39fc0069a0, buf=0x7f3a11117dbf "\377", len=1) at msg/Pipe.cc:2080
#3  0x00007f3a16718461 in Pipe::reader (this=0x7f39fc0069a0) at msg/Pipe.cc:1322
#4  0x00007f3a1671bced in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#5  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a11118700) at pthread_create.c:308
#6  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 11 (Thread 0x7f3a03efe700 (LWP 2933)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16710f8b in Wait (mutex=..., this=0x7f39fc043418) at ./common/Cond.h:55
#2  Pipe::writer (this=0x7f39fc043230) at msg/Pipe.cc:1573
#3  0x00007f3a1671bccd in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a03efe700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 10 (Thread 0x7f3a03bfb700 (LWP 2934)):
#0  0x00007f3a15e4c313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f3a16704e32 in Pipe::tcp_read_wait (this=<optimized out>) at msg/Pipe.cc:2107
#2  0x00007f3a16705150 in Pipe::tcp_read (this=0x7f39fc043230, buf=0x7f3a03bfadbf "\377", len=1) at msg/Pipe.cc:2080
#3  0x00007f3a16718461 in Pipe::reader (this=0x7f39fc043230) at msg/Pipe.cc:1322
#4  0x00007f3a1671bced in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#5  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a03bfb700) at pthread_create.c:308
#6  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f3a03dfd700 (LWP 3060)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16710f8b in Wait (mutex=..., this=0x7f39fc020b98) at ./common/Cond.h:55
#2  Pipe::writer (this=0x7f39fc0209b0) at msg/Pipe.cc:1573
#3  0x00007f3a1671bccd in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a03dfd700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f3a03afa700 (LWP 3061)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16710f8b in Wait (mutex=..., this=0x7f39fc01fad8) at ./common/Cond.h:55
#2  Pipe::writer (this=0x7f39fc01f8f0) at msg/Pipe.cc:1573
#3  0x00007f3a1671bccd in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a03afa700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f3a027f5700 (LWP 3062)):
#0  0x00007f3a15e4c313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f3a16704e32 in Pipe::tcp_read_wait (this=<optimized out>) at msg/Pipe.cc:2107
#2  0x00007f3a16705150 in Pipe::tcp_read (this=0x7f39fc0209b0, buf=0x7f3a027f4dbf "\377", len=1) at msg/Pipe.cc:2080
#3  0x00007f3a16718461 in Pipe::reader (this=0x7f39fc0209b0) at msg/Pipe.cc:1322
#4  0x00007f3a1671bced in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#5  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a027f5700) at pthread_create.c:308
---Type <return> to continue, or q <return> to quit---
#6  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f3a176fe700 (LWP 3063)):
#0  0x00007f3a15e4c313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f3a16704e32 in Pipe::tcp_read_wait (this=<optimized out>) at msg/Pipe.cc:2107
#2  0x00007f3a16705150 in Pipe::tcp_read (this=0x7f39fc01f8f0, buf=0x7f3a176fddbf "\377", len=1) at msg/Pipe.cc:2080
#3  0x00007f3a16718461 in Pipe::reader (this=0x7f39fc01f8f0) at msg/Pipe.cc:1322
#4  0x00007f3a1671bced in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#5  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a176fe700) at pthread_create.c:308
#6  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f3a028f6700 (LWP 3064)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16710f8b in Wait (mutex=..., this=0x7f39fc0089b8) at ./common/Cond.h:55
#2  Pipe::writer (this=0x7f39fc0087d0) at msg/Pipe.cc:1573
#3  0x00007f3a1671bccd in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a028f6700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f3a03fff700 (LWP 3065)):
#0  0x00007f3a15e4c313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f3a16704e32 in Pipe::tcp_read_wait (this=<optimized out>) at msg/Pipe.cc:2107
#2  0x00007f3a16705150 in Pipe::tcp_read (this=0x7f39fc0087d0, buf=0x7f3a03ffedbf "\377", len=1) at msg/Pipe.cc:2080
#3  0x00007f3a16718461 in Pipe::reader (this=0x7f39fc0087d0) at msg/Pipe.cc:1322
#4  0x00007f3a1671bced in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#5  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a03fff700) at pthread_create.c:308
#6  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f3a029f7700 (LWP 3066)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f3a16710f8b in Wait (mutex=..., this=0x7f39fc002298) at ./common/Cond.h:55
#2  Pipe::writer (this=0x7f39fc0020b0) at msg/Pipe.cc:1573
#3  0x00007f3a1671bccd in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59
#4  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a029f7700) at pthread_create.c:308
#5  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f3a03cfc700 (LWP 3067)):
#0  0x00007f3a15e4c313 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f3a16704e32 in Pipe::tcp_read_wait (this=<optimized out>) at msg/Pipe.cc:2107
#2  0x00007f3a16705150 in Pipe::tcp_read (this=0x7f39fc0020b0, buf=0x7f3a03cfbdbf "\377", len=1) at msg/Pipe.cc:2080
#3  0x00007f3a16718461 in Pipe::reader (this=0x7f39fc0020b0) at msg/Pipe.cc:1322
#4  0x00007f3a1671bced in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#5  0x00007f3a15b4ee9a in start_thread (arg=0x7f3a03cfc700) at pthread_create.c:308
#6  0x00007f3a15e57ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f3a17706780 (LWP 29992)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
#1  0x00007f3a15b51065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f3a15b50eba in __pthread_mutex_lock (mutex=0x87bb38) at pthread_mutex_lock.c:61
#3  0x00007f3a16607fb3 in Mutex::Lock (this=0x87bb28, no_lockdep=<optimized out>) at common/Mutex.cc:89
#4  0x00007f3a17290cdc in librbd::ImageCtx::clear_nonexistence_cache (this=0x87b970) at librbd/ImageCtx.cc:588
#5  0x00007f3a172a8440 in librbd::refresh_parent (ictx=0x87b970) at librbd/internal.cc:1614
#6  0x00007f3a1729b9c9 in librbd::ictx_refresh (ictx=0x87b970) at librbd/internal.cc:1760
#7  0x00007f3a1729c9f8 in librbd::ictx_check (ictx=0x87b970) at librbd/internal.cc:1590
#8  0x00007f3a172a4a74 in librbd::info (ictx=0x87b970, info=..., infosize=160) at librbd/internal.cc:1175
#9  0x000000000040412e in check_size () at test/librbd/fsx.c:464
#10 0x000000000040597d in test () at test/librbd/fsx.c:1097
#11 0x00000000004031cf in main (argc=<optimized out>, argv=<optimized out>) at test/librbd/fsx.c:1560

run was
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-07-01_01:00:30-rbd-master-testing-basic/51544$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 9e41c3a278d0c9a62eb96eae6c758504ce168e76
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: e235958fb13e654aa1ad5e1c5a6ccf76a95196ca
  install:
    ceph:
      sha1: e235958fb13e654aa1ad5e1c5a6ccf76a95196ca
  s3tests:
    branch: master
  workunit:
    sha1: e235958fb13e654aa1ad5e1c5a6ccf76a95196ca
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    conf:
      client:
        rbd cache: true
- rbd_fsx:
    clients:
    - client.0
    ops: 5000

PR4013-CherryPick.txt View (2.86 KB) Jean-Charles Lopez, 04/06/2015 01:15 AM

Associated revisions

Revision 703ba377 (diff)
Added by Jason Dillaman over 3 years ago

librbd: acquire cache_lock before refreshing parent

cache_lock needs to be acquired before snap_lock to avoid
the potential for deadlock.

Fixes: #5488
Signed-off-by: Jason Dillaman <>

Revision 364563aa (diff)
Added by Jason Dillaman over 3 years ago

librbd: acquire cache_lock before refreshing parent

cache_lock needs to be acquired before snap_lock to avoid
the potential for deadlock.

Fixes: #5488
Signed-off-by: Jason Dillaman <>

Revision 0cdc93fb (diff)
Added by Jason Dillaman over 3 years ago

librbd: acquire cache_lock before refreshing parent

cache_lock needs to be acquired before snap_lock to avoid
the potential for deadlock.

Fixes: #5488
Signed-off-by: Jason Dillaman <>
(cherry picked from commit 703ba377e3de4007920f2ed7d8a0780f68676fe2)

Conflicts:
src/librbd/internal.cc
resolved by moving int r; in the scope of the block

History

#1 Updated by Sage Weil over 5 years ago

  • Status changed from New to Verified

looks like bh_write is holding cache_lock (== ObjectCacher::lock) and trying ot take snap_read, while ictx_refresh() is holding snap_read (for write) and trying to take cache_lock.

#2 Updated by Sage Weil over 5 years ago

  • Project changed from Ceph to rbd
  • Category deleted (librbd)

#3 Updated by Sage Weil over 5 years ago

  • Assignee set to Sage Weil

#4 Updated by Sage Weil over 5 years ago

  • Status changed from Verified to In Progress

#5 Updated by Sage Weil over 5 years ago

  • Assignee changed from Sage Weil to Josh Durgin

#6 Updated by Sage Weil over 5 years ago

  • Priority changed from Urgent to High

#7 Updated by Josh Durgin over 5 years ago

Another potential deadlock popped up: md_lock vs librados' lock ordering. Still trying to find which callback takes the md_lock.

#8 Updated by Ian Colle over 4 years ago

  • Status changed from In Progress to Need More Info
  • Priority changed from High to Normal

#9 Updated by Josh Durgin over 3 years ago

  • Status changed from Need More Info to Verified
  • Assignee deleted (Josh Durgin)
  • Priority changed from Normal to High

Encountered this during a very long fsx run, deadlock after a flatten due to ictx_refresh()->refresh_parent()->clear_nonexistence_cache() holding snap_lock and trying to take cache_lock, vs writeback from the cache holding cache_lock and waiting on snap_lock.

Thread 23 (Thread 0x7fc21f7fe710 (LWP 25808)):
#0  pthread_rwlock_rdlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_rdlock.S:85
#1  0x00007fc22f8a92d7 in RWLock::get_read (this=0x2adf380) at ./common/RWLock.h:70
#2  0x00007fc22f925563 in librbd::LibrbdWriteback::write (this=0x2ad4b80, oid=..., oloc=..., off=3072926, len=7266, snapc=...,
    bl=..., mtime=..., trunc_size=0, trunc_seq=0, oncommit=0x7fc1b800dab0) at librbd/LibrbdWriteback.cc:160
#3  0x00007fc22fdcc493 in ObjectCacher::bh_write (this=0x2ad7e70, bh=0x2aed950) at osdc/ObjectCacher.cc:847
#4  0x00007fc22fdd3298 in ObjectCacher::flusher_entry (this=0x2ad7e70) at osdc/ObjectCacher.cc:1535
---Type <return> to continue, or q <return> to quit---
#5  0x00007fc22fdd9bae in ObjectCacher::FlusherThread::entry() () from .libs/librbd.so.1
#6  0x00007fc22f994f1a in Thread::entry_wrapper (this=0x2ad80a0) at common/Thread.cc:61
#7  0x00007fc22f994e8c in Thread::_entry_func (arg=0x2ad80a0) at common/Thread.cc:45
#8  0x00007fc22c5d08ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#9  0x00007fc22b98802d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()
...
Thread 1 (Thread 0x7fc232ee3760 (LWP 3225)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00007fc22c5d30e9 in _L_lock_953 () from /lib/libpthread.so.0
#2  0x00007fc22c5d2f0b in __pthread_mutex_lock (mutex=0x2adf338) at pthread_mutex_lock.c:61
#3  0x00007fc22fae759e in Mutex::Lock (this=0x2adf328, no_lockdep=false) at common/Mutex.cc:91
#4  0x00007fc22f8d4db7 in librbd::ImageCtx::clear_nonexistence_cache (this=0x2adf0d0) at librbd/ImageCtx.cc:681
#5  0x00007fc22f90aabc in librbd::refresh_parent (ictx=0x2adf0d0) at librbd/internal.cc:1851
#6  0x00007fc22f90c7e6 in librbd::ictx_refresh (ictx=0x2adf0d0) at librbd/internal.cc:2021
#7  0x00007fc22f90a782 in librbd::ictx_check (ictx=0x2adf0d0) at librbd/internal.cc:1827
#8  0x00007fc22f9067db in librbd::get_flags (ictx=0x2adf0d0, flags=0x7fff0c527170) at librbd/internal.cc:1482
#9  0x00007fc22f8a7348 in rbd_get_flags (image=0x2adf0d0, flags=0x7fff0c527170) at librbd/librbd.cc:1191
#10 0x000000000075a9ed in librbd_verify_object_map (ctx=0xbe0840) at test/librbd/fsx.c:379
#11 0x000000000075af33 in __librbd_flatten (ctx=0xbe0840) at test/librbd/fsx.c:559
#12 0x000000000075af4d in librbd_flatten (ctx=0xbe0840) at test/librbd/fsx.c:565
#13 0x000000000075d933 in do_flatten () at test/librbd/fsx.c:1622
#14 0x000000000075dfa8 in test () at test/librbd/fsx.c:1774
#15 0x000000000075edbc in main (argc=2, argv=0x7fff0c527d38) at test/librbd/fsx.c:2300

fsx command (which ran into the deadlock after op 57807) was:

ceph_test_librbd_fsx -d -W -R -p 100 -r 1 -w 1 -t 1 -h 1 -l 250000000 -S 773 -N 100000 rbd foo7

#10 Updated by Josh Durgin over 3 years ago

  • Assignee set to Jason Dillaman

#11 Updated by Jason Dillaman over 3 years ago

  • Status changed from Verified to In Progress

#12 Updated by Jason Dillaman over 3 years ago

  • Status changed from In Progress to Need Review
  • Backport set to hammer,firefly

#13 Updated by Jason Dillaman over 3 years ago

  • Backport changed from hammer,firefly to hammer,firefly,dumpling

#14 Updated by Josh Durgin over 3 years ago

  • Status changed from Need Review to Pending Backport
  • Backport changed from hammer,firefly,dumpling to hammer,firefly

#15 Updated by Loic Dachary over 3 years ago

#18 Updated by Loic Dachary over 3 years ago

  • Severity changed from 3 - minor to 2 - major

#20 Updated by Loic Dachary over 3 years ago

  • Status changed from Pending Backport to Resolved
  • Regression set to No

Also available in: Atom PDF