Project

General

Profile

Actions

Bug #62298

closed

multisite: FAILED ceph_assert(waiters.find(opaque) == waiters.end())

Added by Casey Bodley 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
multisite
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

seen often during realm reloads when sync threads are shutting down. just after RGWContinuousLeaseCR succeeds in acquiring the lock, its call to yield wait() is triggering this failed assertion

example job from main: https://pulpito.ceph.com/cbodley-2023-08-03_02:45:05-rgw:multisite-main-distro-default-smithi/7359212
radosgw log: http://qa-proxy.ceph.com/teuthology/cbodley-2023-08-03_02:45:05-rgw:multisite-main-distro-default-smithi/7359212/remote/smithi022/log/rgw.c1.client.1.log.gz

   -76> 2023-08-03T04:16:48.565+0000 7fb7e9dc1700  4 rgw realm reloader: Notification on realm, reconfiguration scheduled
   -75> 2023-08-03T04:16:48.565+0000 7fb7e8dbf700  1 rgw realm reloader: Pausing frontends for realm update...
   -74> 2023-08-03T04:16:48.565+0000 7fb7e8dbf700  4 frontend pausing connections...
   -73> 2023-08-03T04:16:48.565+0000 7fb7e8dbf700  4 frontend paused
   -72> 2023-08-03T04:16:48.565+0000 7fb7e8dbf700  4 rgw period pusher: paused for realm update
   -71> 2023-08-03T04:16:48.565+0000 7fb7e8dbf700  1 rgw realm reloader: Frontends paused
...
   -16> 2023-08-03T04:16:48.567+0000 7fb8f9a96700 20 rgw rados thread: run(stacks) returned r=-125
   -15> 2023-08-03T04:16:48.567+0000 7fb8f9a96700  0 RGW-SYNC:meta: ERROR: failed to fetch all metadata keys
   -14> 2023-08-03T04:16:48.569+0000 7fb8f8a94700  5 completion_mgr.get_next() returned ret=-125
   -13> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 rgw rados thread: cr:s=0x564a30f0e140:op=0x564a2e380e00:20RGWContinuousLeaseCR: operate()
   -12> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 rgw rados thread: cr:s=0x564a30f0e140:op=0x564a2d42f500:20RGWSimpleRadosLockCR: operate()
   -11> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 rgw rados thread: cr:s=0x564a30f0e140:op=0x564a2d42f500:20RGWSimpleRadosLockCR: operate()
   -10> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 run: stack=0x564a30f0e140 is io blocked
    -9> 2023-08-03T04:16:48.569+0000 7fb8f8a94700  5 completion_mgr.get_next() returned ret=-125
    -8> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 rgw rados thread: cr:s=0x564a30f0e140:op=0x564a2d42f500:20RGWSimpleRadosLockCR: operate()
    -7> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 rgw rados thread: cr:s=0x564a30f0e140:op=0x564a2d42f500:20RGWSimpleRadosLockCR: operate()
    -6> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 rgw rados thread: cr:s=0x564a30f0e140:op=0x564a2d42f500:20RGWSimpleRadosLockCR: operate()
    -5> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 rgw rados thread: cr:s=0x564a30f0e140:op=0x564a2d42f500:20RGWSimpleRadosLockCR: operate()
    -4> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 rgw rados thread: cr:s=0x564a30f0e140:op=0x564a2e380e00:20RGWContinuousLeaseCR: operate()
    -3> 2023-08-03T04:16:48.569+0000 7fb8f8a94700 20 cr:s=0x564a30f0e140:op=0x564a2e380e00:20RGWContinuousLeaseCR: successfully locked test-zone3.rgw.log:datalog.sync-status.shard.c293b9ad-7bae-4580-9b75-071d4981e59d.1:sync_lock
    -2> 2023-08-03T04:16:48.571+0000 7fb8c1f71700 20 failed to read header: end of stream
    -1> 2023-08-03T04:16:48.573+0000 7fb8f8a94700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-5282-g7d7218ee/rpm/el8/BUILD/ceph-18.0.0-5282-g7d7218ee/src/rgw/rgw_coroutine.cc: In function 'void RGWCompletionManager::wait_interval(void*, const utime_t&, void*)' thread 7fb8f8a94700 time 2023-08-03T04:16:48.570171+0000
/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-5282-g7d7218ee/rpm/el8/BUILD/ceph-18.0.0-5282-g7d7218ee/src/rgw/rgw_coroutine.cc: 117: FAILED ceph_assert(waiters.find(opaque) == waiters.end())

 ceph version 18.0.0-5282-g7d7218ee (7d7218ee3c305d4baf018191d7363988f5b798b0) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7fb92740ee6b]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2ac031) [0x7fb92740f031]
 3: (RGWCompletionManager::wait_interval(void*, utime_t const&, void*)+0x224) [0x564a27bca864]
 4: (RGWCoroutinesStack::wait(utime_t const&)+0x28) [0x564a27bca8a8]
 5: (RGWContinuousLeaseCR::operate(DoutPrefixProvider const*)+0x384) [0x564a28132f94]
 6: (RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x18f) [0x564a27bcccbf]
 7: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x509) [0x564a27bcd499]
 8: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0x91) [0x564a27bce4b1]
 9: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x33c) [0x564a2816df1c]
 10: (RGWDataSyncProcessorThread::process(DoutPrefixProvider const*)+0x58) [0x564a27e9c9e8]
 11: (RGWRadosThread::Worker::entry()+0xb3) [0x564a27e638a3]
 12: /lib64/libpthread.so.0(+0x81ca) [0x7fb9245981ca]
 13: clone()

     0> 2023-08-03T04:16:48.577+0000 7fb8f8a94700 -1 *** Caught signal (Aborted) **
 in thread 7fb8f8a94700 thread_name:data-sync

 ceph version 18.0.0-5282-g7d7218ee (7d7218ee3c305d4baf018191d7363988f5b798b0) reef (dev)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fb9245a2cf0]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7fb92740eec5]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x2ac031) [0x7fb92740f031]
 6: (RGWCompletionManager::wait_interval(void*, utime_t const&, void*)+0x224) [0x564a27bca864]
 7: (RGWCoroutinesStack::wait(utime_t const&)+0x28) [0x564a27bca8a8]
 8: (RGWContinuousLeaseCR::operate(DoutPrefixProvider const*)+0x384) [0x564a28132f94]
 9: (RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x18f) [0x564a27bcccbf]
 10: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x509) [0x564a27bcd499]
 11: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0x91) [0x564a27bce4b1]
 12: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x33c) [0x564a2816df1c]
 13: (RGWDataSyncProcessorThread::process(DoutPrefixProvider const*)+0x58) [0x564a27e9c9e8]
 14: (RGWRadosThread::Worker::entry()+0xb3) [0x564a27e638a3]
 15: /lib64/libpthread.so.0(+0x81ca) [0x7fb9245981ca]
 16: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

a grep from a different run shows several successful realm reloads in between the crashes:

$ zgrep -e ceph_assert -e 'rgw realm reloader: Pausing frontends for realm update...' rgw.c2.client.0.log.gz 
2023-08-02T22:20:00.468+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:36:39.128+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:37:39.603+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:39:36.429+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:41:07.616+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:42:07.983+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:43:08.191+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:44:15.570+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:44:15.997+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:44:16.364+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T22:44:16.834+0000 7f7f3e8a6640  1 rgw realm reloader: Pausing frontends for realm update...
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.0.0-5291-gf936671d/rpm/el9/BUILD/ceph-18.0.0-5291-gf936671d/src/rgw/rgw_coroutine.cc: 117: FAILED ceph_assert(waiters.find(opaque) == waiters.end())
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7fb184b6107d]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7fb184b610d7]
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.0.0-5291-gf936671d/rpm/el9/BUILD/ceph-18.0.0-5291-gf936671d/src/rgw/rgw_coroutine.cc: 117: FAILED ceph_assert(waiters.find(opaque) == waiters.end())
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7fb184b6107d]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7fb184b610d7]
2023-08-02T23:49:19.733+0000 7f9f7b4b1640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T23:50:44.435+0000 7f9f7b4b1640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T23:50:45.545+0000 7f9f7b4b1640  1 rgw realm reloader: Pausing frontends for realm update...
2023-08-02T23:53:00.189+0000 7f9f7b4b1640  1 rgw realm reloader: Pausing frontends for realm update...
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.0.0-5291-gf936671d/rpm/el9/BUILD/ceph-18.0.0-5291-gf936671d/src/rgw/rgw_coroutine.cc: 117: FAILED ceph_assert(waiters.find(opaque) == waiters.end())
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7fa0b156107d]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7fa0b15610d7]
   -80> 2023-08-02T23:53:00.189+0000 7f9f7b4b1640  1 rgw realm reloader: Pausing frontends for realm update...
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.0.0-5291-gf936671d/rpm/el9/BUILD/ceph-18.0.0-5291-gf936671d/src/rgw/rgw_coroutine.cc: 117: FAILED ceph_assert(waiters.find(opaque) == waiters.end())
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7fa0b156107d]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7fa0b15610d7]


Related issues 1 (0 open1 closed)

Related to rgw - Bug #61461: multisite: hang when update and commit periodResolved

Actions
Actions #1

Updated by Shilpa MJ 9 months ago

  • Assignee set to Shilpa MJ
Actions #2

Updated by Casey Bodley 9 months ago

i was able to reproduce this locally by starting two zones and running period update --commit several times. i reduced rgw sync lease period = 10 to make the race more likely

Actions #3

Updated by Shilpa MJ 9 months ago

I removed data sync fairness pr and tested on main. I see a different stack trace after running 'period commit --update' a few times.

2023-08-15T14:27:17.552-0400 7f9d772306c0 -1 /home/smanjara/ceph/src/common/RefCountedObj.cc: In function 'void ceph::co
mmon::RefCountedObject::_get() const' thread 7f9d772306c0 time 2023-08-15T14:27:17.531728-0400
/home/smanjara/ceph/src/common/RefCountedObj.cc: 36: FAILED ceph_assert(v > 1)

ceph version 18.0.0-5505-g21e174b7b6d (21e174b7b6d2baff00ad6f3e6e8ba52eb14ff98e) reef (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f9ead3ae2f9]
2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f9ead3ae500]
3: (ceph::common::RefCountedObject::_get() const+0x6e) [0x7f9ead38f528]
4: (RGWAioCompletionNotifier::~RGWAioCompletionNotifier()+0x73) [0x55d106e94fc1]
5: (RGWAioCompletionNotifier::~RGWAioCompletionNotifier()+0x9) [0x55d106e94fe3]
6: (ceph::common::RefCountedObject::put() const+0x11e) [0x7f9ead38f330]
7: (std::_Rb_tree&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt;, boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt;, std::_Identity&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >, std::less&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >, std::allocator&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; > >::_M_drop_node(std::_Rb_tree_node&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >)+0x1a) [0x55d106e94e2e]
8: (std::_Rb_tree&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt;, boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt;, std::_Identity&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >, std::less&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >, std::allocator&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; > >::_M_erase(std::_Rb_tree_node&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >
)+0x2a) [0x55d106e94e70]
9: (std::_Rb_tree&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt;, boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt;, std::_Identity&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >, std::less&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >, std::allocator&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; > >::_M_erase(std::_Rb_tree_node&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >)+0x1b) [0x55d106e94e61]
10: (std::_Rb_tree&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt;, boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt;, std::_Identity&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >, std::less&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >, std::allocator&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; > >::_M_erase(std::_Rb_tree_node&lt;boost::intrusive_ptr&lt;RGWAioCompletionNotifier&gt; >
)+0x1b) [0x55d106e94e61]
11: (RGWCompletionManager::~RGWCompletionManager()+0x92) [0x55d106e92fb4]
12: (RGWCompletionManager::~RGWCompletionManager()+0x9) [0x55d106e92fe5]
13: (ceph::common::RefCountedObject::put() const+0x11e) [0x7f9ead38f330]
14: (RGWCoroutinesManager::~RGWCoroutinesManager()+0x23) [0x55d106e8f8ed]
15: (RGWRemoteDataLog::~RGWRemoteDataLog()+0x60) [0x55d1070a02fa]
16: (RGWDataSyncStatusManager::~RGWDataSyncStatusManager()+0x4b) [0x55d1070ab0db]
17: (RGWDataSyncProcessorThread::~RGWDataSyncProcessorThread()+0x1b) [0x55d1070ab135]
18: (RGWDataSyncProcessorThread::~RGWDataSyncProcessorThread()+0x9) [0x55d1070ab15f]
19: (RGWRados::finalize()+0x158) [0x55d10709b5c8]
20: (rgw::sal::RadosStore::finalize()+0x12) [0x55d1070c3cf6]
21: (DriverManager::close_storage(rgw::sal::Driver*)+0x12) [0x55d106fb3446]
22: (RGWRealmReloader::reload()+0x271) [0x55d106d4eebb]
23: (RGWRealmReloader::C_Reload::finish(int)+0xd) [0x55d106d500b5]
24: (Context::complete(int)+0x9) [0x55d106c75a0f]
25: (CommonSafeTimer&lt;ceph::mutex_debug_detail::mutex_debug_impl&lt;false&gt; >::timer_thread()+0x1b2) [0x7f9ead3981fe]
26: (CommonSafeTimerThread&lt;ceph::mutex_debug_detail::mutex_debug_impl&lt;false&gt; >::entry()+0xd) [0x7f9ead39865f]
27: (Thread::entry_wrapper()+0x3f) [0x7f9ead391371]
28: (Thread::_entry_func(void*)+0x9) [0x7f9ead391389]
29: /lib64/libc.so.6(+0x8b12d) [0x7f9eab8ae12d]
30: /lib64/libc.so.6(+0x10cbc0) [0x7f9eab92fbc0]
Actions #4

Updated by Shilpa MJ 9 months ago

I suspect the cause to be https://github.com/ceph/ceph/pull/51772
Calling http_manager stop() before RGWCoroutinesManager::stop() could be preventing proper unregister and shutdown of all completion notifiers through RGWCompletionManager.

Actions #6

Updated by Casey Bodley 9 months ago

  • Related to Bug #61461: multisite: hang when update and commit period added
Actions #7

Updated by Casey Bodley 9 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 53045
Actions #8

Updated by Casey Bodley 8 months ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF