Project

General

Profile

Bug #22833

rgw: crash in RGWSyncTraceManager::finish_node()

Added by Yehuda Sadeh over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
-
Start date:
01/30/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Reported by mkogan:

This abort occurred on master branch after sync of 2692 objects.

cosbench is writing s3 , 4mb obj workload to c1, c2 is syncing

$ ../src/mrun c2 radosgw-admin sync status
2018-01-29 13:53:26.205 7fc7075243c0 -1 WARNING: all dangerous and experimental features are enabled.
2018-01-29 13:53:26.205 7fc7075243c0 -1 WARNING: all dangerous and experimental features are enabled.
2018-01-29 13:53:26.207 7fc7075243c0 -1 WARNING: all dangerous and experimental features are enabled.
          realm a720ab67-014d-47f7-b1df-2a39c14e4aca (gold)
      zonegroup 2b112a20-53b1-407e-b732-2e962c11d7a5 (us)
           zone c5488b04-0cd7-4682-8067-f5f26ce64dff (us-west)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 1dce771c-47f1-42c7-8deb-de1798034732 (us-east)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        oldest incremental change not applied: 2018-01-29 12:23:36.0.715214

from radosgw.8001.log of c2 :
(I will rebuild with debug info tomorrow and reproduce with better debug info)

  -10> 2018-01-29 12:23:26.628 7f42a08b3700 20 RGWDataChangesLog::add_entry() bucket.name=s3testqwer011 shard_id=-1 now=2018-01-29 12:23:26.0.629662s cur_expiration=2018-01-29 12:23:36.0
    -9> 2018-01-29 12:23:26.629 7f42998a5700 20 cr:s=0x558091170f00:op=0x5580948b8d00:19RGWFetchRemoteObjCR: operate()
    -8> 2018-01-29 12:23:26.629 7f42998a5700 20 cr:s=0x558091170f00:op=0x5580948b8d00:19RGWFetchRemoteObjCR: operate()
    -7> 2018-01-29 12:23:26.629 7f42998a5700 20 cr:s=0x558091170f00:op=0x5580948b8d00:19RGWFetchRemoteObjCR: operate()
    -6> 2018-01-29 12:23:26.629 7f42998a5700 20 cr:s=0x558091170f00:op=0x5580948b8d00:19RGWFetchRemoteObjCR: operate()
    -5> 2018-01-29 12:23:26.629 7f42998a5700 20 cr:s=0x558091170f00:op=0x55809541e400:26RGWBucketSyncSingleEntryCRINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE11rgw_obj_keyE: oper
    -4> 2018-01-29 12:23:26.629 7f42998a5700 10 RGW-SYNC:data:sync:shard[33]:entry[s3testqwer011:1dce771c-47f1-42c7-8deb-de1798034732.4146.1]:bucket[s3testqwer011:1dce771c-47f1-42c7-8deb-
    -3> 2018-01-29 12:23:26.629 7f42998a5700  5 data sync: Sync:1dce771c:data:Object:s3testqwer011:1dce771c-47f1-42c7-8deb-de1798034732.4146.1/myobjects30273[0]:done
    -2> 2018-01-29 12:23:26.629 7f42998a5700 20 cr:s=0x558091170f00:op=0x55809541e400:26RGWBucketSyncSingleEntryCRINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE11rgw_obj_keyE: oper
    -1> 2018-01-29 12:23:26.630 7f42b7af3700  1 -- 192.168.39.252:0/3791001179 <== osd.0 192.168.39.252:6806/4684 33033 ==== osd_op_reply(31213 .dir.1dce771c-47f1-42c7-8deb-de1798034732.4
     0> 2018-01-29 12:23:26.634 7f42998a5700 -1 *** Caught signal (Aborted) **
 in thread 7f42998a5700 thread_name:data-sync

 ceph version 13.0.0-5206-g5f01bcddff (5f01bcddffa327334652afb0d55ae26009e11d5c) mimic (dev)
 1: (()+0x1f0090) [0x55808d29f090]
 2: (()+0x12af0) [0x7f42c92d9af0]
 3: (gsignal()+0xcb) [0x7f42bdf4166b]
 4: (abort()+0x141) [0x7f42bdf43381]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7f42be8eb025]
 6: (()+0x8fc16) [0x7f42be8e8c16]
 7: (()+0x8fc61) [0x7f42be8e8c61]
 8: (()+0x8fea4) [0x7f42be8e8ea4]
 9: (std::__throw_system_error(int)+0x8e) [0x7f42be914f7e]
 10: (()+0x52e286) [0x55808d5dd286]
 11: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x55808d24bdb6]
 12: (std::_Sp_counted_ptr<RGWSyncTraceNode*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x106) [0x55808d5e0226]
 13: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x55808d24bdb6]
 14: (RGWSyncTraceManager::finish_node(RGWSyncTraceNode*)+0x17b) [0x55808d5dd20b]
 15: (RGWBucketSyncSingleEntryCR<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rgw_obj_key>::~RGWBucketSyncSingleEntryCR()+0x23a) [0x55808d59c8da]
 16: (RefCountedObject::put() const+0x27a) [0x55808d2c8faa]
 17: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x1e9) [0x55808d355999]
 18: (RGWCoroutinesManager::run(std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x1ef) [0x55808d35910f]
 19: (RGWCoroutinesManager::run(RGWCoroutine*)+0x9b) [0x55808d35a0ab]
 20: (RGWRemoteDataLog::run_sync(int)+0x209) [0x55808d58caa9]
 21: (RGWDataSyncProcessorThread::process()+0x46) [0x55808d40ebb6]
 22: (RGWRadosThread::Worker::entry()+0x230) [0x55808d3acaf0]
 23: (()+0x761b) [0x7f42c92ce61b]
 24: (clone()+0x3f) [0x7f42be02198f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

....

#12 0x00007f9810525e97 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x55623dde4620, tinfo=tinfo@entry=0x7f9810810ba0 <typeinfo for std::system_error>,
    dest=dest@entry=0x7f9810551c50 <std::system_error::~system_error()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:88
#13 0x00007f9810551f7e in std::__throw_system_error (__i=__i@entry=35) at ../../../../../libstdc++-v3/src/c++11/system_error.cc:81
#14 0x000055623b592196 in std::__shared_mutex_pthread::lock (this=0x55623d35ee08) at /usr/include/c++/7/shared_mutex:107
#15 std::shared_timed_mutex::lock (this=0x55623d35ee08) at /usr/include/c++/7/shared_mutex:378
#16 ceph::shunique_lock<std::shared_timed_mutex>::lock (this=<synthetic pointer>) at /mnt/vdb1/src-git/ceph--ceph-master/src/common/shunique_lock.h:157
#17 ceph::shunique_lock<std::shared_timed_mutex>::shunique_lock (m=..., this=<synthetic pointer>) at /mnt/vdb1/src-git/ceph--ceph-master/src/common/shunique_lock.h:65
#18 RGWSyncTraceManager::finish_node (this=0x55623d35ee00, node=0x55623d9d8d80) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_sync_trace.cc:260
#19 0x000055623b200db6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55623ea437d0) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#20 0x000055623b595106 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55623d9d8f18, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:684
#21 std::__shared_ptr<RGWSyncTraceNode, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55623d9d8f10, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:1123
#22 std::shared_ptr<RGWSyncTraceNode>::~shared_ptr (this=0x55623d9d8f10, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:93
#23 RGWSyncTraceNode::~RGWSyncTraceNode (this=<optimized out>, __in_chrg=<optimized out>) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_sync_trace.h:32
#24 std::_Sp_counted_ptr<RGWSyncTraceNode*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:376
#25 0x000055623b200db6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55623d8cafc0) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#26 0x000055623b59211b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::operator= (__r=..., this=0x55623d4680a8) at /usr/include/c++/7/bits/shared_ptr_base.h:703
#27 std::__shared_ptr<RGWSyncTraceNode, (__gnu_cxx::_Lock_policy)2>::operator= (this=0x55623d4680a0) at /usr/include/c++/7/bits/shared_ptr_base.h:1034
#28 std::shared_ptr<RGWSyncTraceNode>::operator= (this=0x55623d4680a0) at /usr/include/c++/7/bits/shared_ptr.h:93
#29 boost::circular_buffer<std::shared_ptr<RGWSyncTraceNode>, std::allocator<std::shared_ptr<RGWSyncTraceNode> > >::replace (this=0x55623d35ee80, item=..., pos=0x55623d4680a0)
    at /mnt/vdb1/src-git/ceph--ceph-master/build/boost/include/boost/circular_buffer/base.hpp:2396
#30 boost::circular_buffer<std::shared_ptr<RGWSyncTraceNode>, std::allocator<std::shared_ptr<RGWSyncTraceNode> > >::push_back_impl<std::shared_ptr<RGWSyncTraceNode> const&> (
    item=std::shared_ptr (count 3, weak 0) 0x55623ed51380, this=0x55623d35ee80) at /mnt/vdb1/src-git/ceph--ceph-master/build/boost/include/boost/circular_buffer/base.hpp:1421
#31 boost::circular_buffer<std::shared_ptr<RGWSyncTraceNode>, std::allocator<std::shared_ptr<RGWSyncTraceNode> > >::push_back (item=..., this=0x55623d35ee80)
    at /mnt/vdb1/src-git/ceph--ceph-master/build/boost/include/boost/circular_buffer/base.hpp:1471
#32 RGWSyncTraceManager::finish_node (this=0x55623d35ee00, node=<optimized out>) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_sync_trace.cc:270
#33 0x000055623b551f3a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5562408f3d40) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#34 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55623e768c68, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:684
#35 std::__shared_ptr<RGWSyncTraceNode, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55623e768c60, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:1123
#36 std::shared_ptr<RGWSyncTraceNode>::~shared_ptr (this=0x55623e768c60, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:93
#37 RGWRunBucketSyncCoroutine::~RGWRunBucketSyncCoroutine (this=0x55623e768000, __in_chrg=<optimized out>) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_data_sync.cc:966
#38 RGWRunBucketSyncCoroutine::~RGWRunBucketSyncCoroutine (this=0x55623e768000, __in_chrg=<optimized out>) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_data_sync.cc:970
#39 0x000055623b27df2a in RefCountedObject::put (this=0x55623e768000) at /mnt/vdb1/src-git/ceph--ceph-master/src/common/RefCountedObj.h:58
#40 0x000055623b30a8e9 in RGWCoroutinesStack::operate (this=0x55623ddbc2d0, _env=_env@entry=0x7f97eb4ddac0) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_coroutine.cc:205
#41 0x000055623b30e05f in RGWCoroutinesManager::run (this=this@entry=0x55623d406b18, stacks=std::__cxx11::list = {...}) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_coroutine.cc:484
#42 0x000055623b30effb in RGWCoroutinesManager::run (this=this@entry=0x55623d406b18, op=0x55623d3a3800) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_coroutine.cc:623
#43 0x000055623b5419b9 in RGWRemoteDataLog::run_sync (this=this@entry=0x55623d406b18, num_shards=<optimized out>) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_data_sync.cc:1750
#44 0x000055623b3c3af6 in RGWDataSyncStatusManager::run (this=0x55623d406a48) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_data_sync.h:347
#45 RGWDataSyncProcessorThread::process (this=0x55623d406a00) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_rados.cc:3237
#46 0x000055623b361a30 in RGWRadosThread::Worker::entry (this=0x55623d4b52c0) at /mnt/vdb1/src-git/ceph--ceph-master/src/rgw/rgw_rados.cc:3043
#47 0x00007f981af0b61b in start_thread (arg=0x7f97eb4e2700) at pthread_create.c:465
#48 0x00007f980fc5e98f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


Related issues

Related to rgw - Backport #23018: rgw multisite sync tracing feature Need More Info

History

#1 Updated by Yehuda Sadeh over 1 year ago

  • Priority changed from Normal to Urgent
  • Backport set to luminous

#2 Updated by Yehuda Sadeh over 1 year ago

  • Status changed from New to Need Review

#3 Updated by Casey Bodley over 1 year ago

  • Status changed from Need Review to Testing

#4 Updated by Orit Wasserman over 1 year ago

  • Status changed from Testing to Resolved

#5 Updated by Casey Bodley over 1 year ago

#6 Updated by Nathan Cutler over 1 year ago

  • Backport deleted (luminous)

Luminous backport to be done in #23018

Also available in: Atom PDF