Project

General

Profile

Bug #17252

[Librados] Deadlock on RadosClient::watch_flush

Added by Xiaoxi Chen over 7 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
Dev Interfaces
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
librados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://github.com/ceph/ceph/blob/v10.2.2/src/librados/RadosClient.cc#L341, in the watch_flush, it is waiting on conditon with holding the "lock". The condition will only be signal by finisher thread, but sadly, in some cases finisher thread need to take the "lock", thus deadlock.

This bug could block cinder-volume service and cause cinder-volume waiting infinitely.Also anyone who using the library like what suggested in http://docs.ceph.com/docs/jewel/rbd/librbdpy/.

See below BT for instances, Thread 1 is the thread called "watch_flush" while thread 7 is the finisher thread.

Thread 1 (Thread 0x7f193783a740 (LWP 9216)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f1928165ff1 in Wait (mutex=..., this=0x7ffebf02aa40) at ./common/Cond.h:56
#2 librados::RadosClient::watch_flush (this=this@entry=0x30a86c0) at librados/RadosClient.cc:371
#3 0x00007f19281663b0 in librados::RadosClient::shutdown (this=0x30a86c0) at librados/RadosClient.cc:341
#4 0x00007f192813872c in rados_shutdown (cluster=0x30a86c0) at librados/librados.cc:2620
#5 0x00007f1931874ba2 in _pyx_pf_5rados_5Rados_6shutdown (_pyx_v_self=0x7f19199ae460, _pyx_v_self=0x7f19199ae460)
at /tmp/buildd/ceph-10.2.2/src/build/rados.c:6978
#6 __pyx_pw_5rados_5Rados_7shutdown (
_pyx_v_self=0x7f19199ae460, unused=<optimized out>) at /tmp/buildd/ceph-10.2.2/src/build/rados.c:6931
#7 0x000000000049a3b5 in PyEval_EvalFrameEx ()
#8 0x00000000004a090c in PyEval_EvalCodeEx ()
#9 0x000000000049ab45 in PyEval_EvalFrameEx ()
#10 0x00000000004a1c9a in ?? ()
#11 0x00000000004dfe94 in ?? ()
#12 0x00000000004dcd81 in PyObject_CallFunctionObjArgs ()
#13 0x000000000049b20c in PyEval_EvalFrameEx ()
#14 0x00000000004a1c9a in ?? ()
#15 0x0000000000505f96 in PyObject_Call ()
#16 0x000000000049b07a in PyEval_EvalFrameEx ()
#17 0x00000000004a090c in PyEval_EvalCodeEx ()
#18 0x000000000049ab45 in PyEval_EvalFrameEx ()

Thread 7 (Thread 0x7f18eafef700 (LWP 11382)):
#0 lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f1937414657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f1937414480 in __GI
_pthread_mutex_lock (mutex=0x30a8cf8) at ../nptl/pthread_mutex_lock.c:79
#3 0x00007f1928216a68 in Mutex::Lock (this=0x30a8ce8, no_lockdep=<optimized out>) at common/Mutex.cc:110
#4 0x00007f1928163488 in Locker (m=..., this=<synthetic pointer>) at ./common/Mutex.h:115
#5 librados::RadosClient::put (this=0x30a86c0) at librados/RadosClient.cc:623
#6 0x00007f1928137c01 in librados::Rados::shutdown (this=0x2d6ff98) at librados/librados.cc:2088
#7 0x00007f191c4b93e8 in ~C_UnwatchAndFlush (this=0x2d6ff90, __in_chrg=<optimized out>) at librbd/ImageWatcher.cc:39
#8 librbd::(anonymous namespace)::C_UnwatchAndFlush::~C_UnwatchAndFlush (this=0x2d6ff90, __in_chrg=<optimized out>) at librbd/ImageWatcher.cc:39
#9 0x00007f192816e73d in librados::C_AioSafe::finish (this=0x7f1910009890, r=<optimized out>) at librados/AioCompletionImpl.h:199
#10 0x00007f1928157ec9 in Context::complete (this=0x7f1910009890, r=<optimized out>) at ./include/Context.h:64
#11 0x00007f1928239786 in Finisher::finisher_thread_entry (this=0x30a8e70) at common/Finisher.cc:68
#12 0x00007f1937412182 in start_thread (arg=0x7f18eafef700) at pthread_create.c:312
#13 0x00007f193713f47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

History

#1 Updated by Xiaoxi Chen over 7 years ago

Fix should be passing lock to watch_flush, and use the lock instead of local variable "mylock" for condition wait.

#2 Updated by Jason Dillaman over 7 years ago

cinder is explicitly closing the rados connection before closing the image?

#3 Updated by Xiaoxi Chen over 7 years ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Haomai Wang to Xiaoxi Chen
  • Priority changed from Immediate to High

#4 Updated by Loïc Dachary over 7 years ago

  • Target version deleted (v10.2.3)

#5 Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category changed from librados to Dev Interfaces
  • Component(RADOS) librados added

#6 Updated by Greg Farnum over 5 years ago

  • Status changed from Fix Under Review to 12
  • Assignee deleted (Xiaoxi Chen)

#7 Updated by Greg Farnum over 4 years ago

  • Status changed from 12 to Can't reproduce

This hasn't come up again and got fixed in the only user.

Also available in: Atom PDF