Project

General

Profile

Actions

Bug #22001

open

multisite: dead lock in RGWSyncTraceManager::finish_node

Added by Tianshan Qu over 6 years ago. Updated over 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

complete_nodes is a circular_buffer.
RGWSyncTraceManager::finish_node() will push back to complete_nodes, which may weed out some node and call destroy
In ~RGWSyncTraceNode will release parent count, and may trigger another finish_node(), cause the dead lock.

example of dead lock stack:

#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fbf9ef1f7ec in boost::condition_variable::wait (this=this@entry=0x7fbfa0d32f90, m=...) at /root/ceph/build/boost/include/boost/thread/pthread/condition_variable.hpp:76
#2 0x00007fbf9ef1b86f in lock (this=0x7fbfa0d32f08) at /root/ceph/build/boost/include/boost/thread/pthread/shared_mutex.hpp:294
#3 lock (this=0x7fbf6e42a800) at /root/ceph/src/common/shunique_lock.h:157
#4 shunique_lock (m=..., this=0x7fbf6e42a800) at /root/ceph/src/common/shunique_lock.h:65
#5 RGWSyncTraceManager::finish_node (this=0x7fbfa0d32f00, node=0x7fbfa103b570) at /root/ceph/src/rgw/rgw_sync_trace.cc:263
#6 0x00007fbf9eb76d89 in std::Sp_counted_base<(gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fbfa12beea0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144
#7 0x00007fbf9ef1f2cc in ~
_shared_count (this=0x7fbfa103b858, _in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:546
#8 ~
_shared_ptr (this=0x7fbfa103b850, _in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:781
#9 ~shared_ptr (this=0x7fbfa103b850, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr.h:93
#10 ~RGWSyncTraceNode (this=<optimized out>, __in_chrg=<optimized out>) at /root/ceph/src/rgw/rgw_sync_trace.h:31
#11 std::Sp_counted_ptr<RGWSyncTraceNode*, (_gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:290
#12 0x00007fbf9eb76d89 in std::Sp_counted_base<(_gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fbfa0f3eca0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144
#13 0x00007fbf9ef1ba54 in operator= (
_r=..., this=0x7fbfa0ef0028) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:565
#14 operator= (this=0x7fbfa0ef0020) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:728
#15 operator= (this=0x7fbfa0ef0020) at /usr/include/c++/4.8.2/bits/shared_ptr.h:93
#16 replace (this=<optimized out>, item=..., pos=0x7fbfa0ef0020) at /root/ceph/build/boost/include/boost/circular_buffer/base.hpp:2396
#17 push_back_impl<std::shared_ptr<RGWSyncTraceNode> const&> (item=std::shared_ptr (count 3, weak 0) 0x7fbfa1968d00, this=0x7fbfa0d33080) at /root/ceph/build/boost/include/boost/circular_buffer/base.hpp:1421
#18 push_back (item=..., this=0x7fbfa0d33080) at /root/ceph/build/boost/include/boost/circular_buffer/base.hpp:1471
#19 RGWSyncTraceManager::finish_node (this=0x7fbfa0d32f00, node=<optimized out>) at /root/ceph/src/rgw/rgw_sync_trace.cc:273
#20 0x00007fbf9eb76d89 in std::Sp_counted_base<(gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fbfa2d8b1a0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144
#21 0x00007fbf9eef0368 in ~
_shared_count (this=0x7fbfa210f7b0, _in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:546
#22 ~
_shared_ptr (this=0x7fbfa210f7a8, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:781
#23 ~shared_ptr (this=0x7fbfa210f7a8, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr.h:93
#24 ~RGWBucketSyncSingleEntryCR (this=0x7fbfa210f000, __in_chrg=<optimized out>) at /root/ceph/src/rgw/rgw_data_sync.cc:2283
#25 RGWBucketSyncSingleEntryCR<std::string, rgw_obj_key>::~RGWBucketSyncSingleEntryCR (this=0x7fbfa210f000, __in_chrg=<optimized out>) at /root/ceph/src/rgw/rgw_data_sync.cc:2283
#26 0x00007fbf9ebea5aa in RefCountedObject::put (this=this@entry=0x7fbfa210f000) at /root/ceph/src/common/RefCountedObj.h:58
#27 0x00007fbf9ec5d6cf in RGWCoroutinesStack::operate (this=0x7fbfa1639ea0, _env=_env@entry=0x7fbf6e42ad90) at /root/ceph/src/rgw/rgw_coroutine.cc:205
#28 0x00007fbf9ec5facb in RGWCoroutinesManager::run (this=this@entry=0x7fbfa0d209c8, stacks=std::list = {...}) at /root/ceph/src/rgw/rgw_coroutine.cc:485
#29 0x00007fbf9ec60810 in RGWCoroutinesManager::run (this=this@entry=0x7fbfa0d209c8, op=0x7fbfa12d0300) at /root/ceph/src/rgw/rgw_coroutine.cc:624
#30 0x00007fbf9eedd4d7 in RGWRemoteDataLog::run_sync (this=this@entry=0x7fbfa0d209c8, num_shards=<optimized out>) at /root/ceph/src/rgw/rgw_data_sync.cc:1740
#31 0x00007fbf9ed59db6 in run (this=0x7fbfa0d20970) at /root/ceph/src/rgw/rgw_data_sync.h:326
#32 RGWDataSyncProcessorThread::process (this=0x7fbfa0d20940) at /root/ceph/src/rgw/rgw_rados.cc:4481
#33 0x00007fbf9ecc7b53 in RGWRadosThread::Worker::entry (this=0x7fbfa0d2c2b0) at /root/ceph/src/rgw/rgw_rados.cc:4289
#34 0x00007fbf9d4c7dc5 in start_thread (arg=0x7fbf6e42d700) at pthread_create.c:308
#35 0x00007fbf9226721d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Actions #1

Updated by Tianshan Qu over 6 years ago

I did not see the necessity to preserve parent, so just removed that will fix the problem.
fix in https://github.com/ceph/ceph/pull/18677

Actions

Also available in: Atom PDF