Project

General

Profile

Actions

Bug #20776

closed

Possible deadlock during CephContextServiceThread shutdown

Added by Jason Dillaman almost 7 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
kraken,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If the thread is requested to stop before it starts, it's possible that it will deadlock waiting on a conditional. The "while (1)" loop should be replaced with a "while (!_exit_thread)" and the lock should be moved outside the loop.

I'm not sure yet whether or not this is made worse by config, however - if I do something along the lines of:

seq 100 | xargs -P100 -n1 bash -c 'exec rbd.original showmapped'

I'll end up with at least one of the invocations deadlocked like below. Doing the same on our v10.2.7 clusters seems to work fine.

The stacktraces according to GDB looks something like this for all the ones I've looked at at least:
warning: the debug information found in "/usr/bin/rbd" does not match "/usr/bin/rbd.original" (CRC mismatch).
# Yes - we've diverted rbd to rbd.original with a shell-wrapper around it

[New LWP 285438]
[New LWP 285439]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fbbea58798d in pthread_join (threadid=140444952844032, thread_return=thread_return@entry=0x0) at pthread_join.c:90
90      pthread_join.c: No such file or directory.
Thread 3 (Thread 0x7fbbe3865700 (LWP 285439)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055a852fcf896 in Cond::Wait (mutex=..., this=0x55a85cdeb258) at ./common/Cond.h:56
#2  CephContextServiceThread::entry (this=0x55a85cdeb1c0) at common/ceph_context.cc:101
#3  0x00007fbbea5866ba in start_thread (arg=0x7fbbe3865700) at pthread_create.c:333
#4  0x00007fbbe80743dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 2 (Thread 0x7fbbe4804700 (LWP 285438)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055a852fb297b in ceph::log::Log::entry (this=0x55a85cd98830) at log/Log.cc:457
#2  0x00007fbbea5866ba in start_thread (arg=0x7fbbe4804700) at pthread_create.c:333
#3  0x00007fbbe80743dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 1 (Thread 0x7fbbfda1e100 (LWP 285436)):
#0  0x00007fbbea58798d in pthread_join (threadid=140444952844032, thread_return=thread_return@entry=0x0) at pthread_join.c:90
#1  0x000055a852fb6270 in Thread::join (this=this@entry=0x55a85cdeb1c0, prval=prval@entry=0x0) at common/Thread.cc:171
#2  0x000055a852fca060 in CephContext::join_service_thread (this=this@entry=0x55a85cd95780) at common/ceph_context.cc:637
#3  0x000055a852fcc2c7 in CephContext::~CephContext (this=0x55a85cd95780, __in_chrg=<optimized out>) at common/ceph_context.cc:507
#4  0x000055a852fcc9bc in CephContext::put (this=0x55a85cd95780) at common/ceph_context.cc:578
#5  0x000055a852eac2b1 in boost::intrusive_ptr<CephContext>::~intrusive_ptr (this=0x7ffef7ef5060, __in_chrg=<optimized out>) at /usr/include/boost/smart_ptr/intrusive_ptr.hpp:97
#6  main (argc=<optimized out>, argv=<optimized out>) at tools/rbd/rbd.cc:17 
Actions #1

Updated by Kjetil Joergensen almost 7 years ago

The solution proposed by Jason appears to solve my problem. (Now I haven't even began to unravel what else it may or may not break in the process, nor have I tested whether or not the act of re-compiling v10.2.7).

diff --git a/src/common/ceph_context.cc b/src/common/ceph_context.cc
index 38a4e20123..73f091e487 100644
--- a/src/common/ceph_context.cc
+++ b/src/common/ceph_context.cc
@@ -91,9 +91,8 @@ public:

   void *entry()
   {
-    while (1) {
-      Mutex::Locker l(_lock);
-
+    Mutex::Locker l(_lock);
+    while (!_exit_thread) {
       if (_cct->_conf->heartbeat_interval) {
         utime_t interval(_cct->_conf->heartbeat_interval, 0);
         _cond.WaitInterval(_cct, _lock, interval);

I'm also not entirely clear on why this became a problem given that src/common/ceph_context.cc appears rather unchanged between v10.2.7 and v10.2.9, so there's something else that tickles this.

Actions #2

Updated by Kjetil Joergensen almost 7 years ago

Ok - to repro, I'll revise: seq 100 | xargs -P100 -n1 bash -c 'exec rbd --heartbeat_interval=0 showmapped' - it defaults to 5, somewhere in my config it's set to 0. heartbeat_interval > 0 saves us here: https://github.com/ceph/ceph/blob/v10.2.9/src/common/ceph_context.cc#L97

Which means I have a workaround/fix that doesn't involve patching up rbd/libceph.

Actions #3

Updated by Sage Weil almost 3 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF