Project

General

Profile

Actions

Bug #43289

closed

Crash in unittest_rbd_mirror due to race condition in PoolReplayer

Added by Ulrich Weigand over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On IBM Z I'm now seeing crashes on current mainline in unittest_rbd_mirror (in the TestMockPoolReplayer.NamespacesError test case):

/home/uweigand/ceph/src/test/rbd_mirror/test_mock_PoolReplayer.cc: 141: FAILED ceph_assert(s_instances.count(name))
ceph version 15.0.0-8299-gbe58eb0a8d (be58eb0a8d242bb4542177a1278be8ea1ba04045) octopus (dev)

Backtrace at the crash site:
#0 GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x000003fff262af78 in _GI_abort () at abort.c:79
#2 0x000003fff44a5990 in ceph::
_ceph_assert_fail (assertion=0x2aa02598ca0 "s_instances.count(name)",
file=0x2aa02598cb8 "/home/uweigand/ceph/src/test/rbd_mirror/test_mock_PoolReplayer.cc", line=141,
func=0x2aa0259e442 <rbd::mirror::NamespaceReplayer<librbd::(anonymous namespace)::MockTestImageCtx>::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::v14_2_0::IoCtx&, librados::v14_2_0::IoCtx&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rbd::mirror::Threads<librbd::(anonymous namespace)::MockTestImageCtx>*, rbd::mirror::Throttler<librbd::(anonymous namespace)::MockTestImageCtx>*, rbd::mirror::Throttler<librbd::(anonymous namespace)::MockTestImageCtx>*, rbd::mirror::ServiceDaemon<librbd::(anonymous namespace)::MockTestImageCtx>*, journal::CacheManagerHandler*)::__PRETTY_FUNCTION
> "static rbd::mirror::NamespaceReplayer<librbd::{anonymous}::MockTestImageCtx>* rbd::mirror::NamespaceReplayer<librbd::{anonymous}::MockTestImageCtx>::create(const string&, librados::v14_2_0::IoCtx&, li"...)
at /home/uweigand/ceph/src/common/assert.cc:73
#3 0x000003fff44a5a86 in ceph::__ceph_assert_fail (ctx=...) at /home/uweigand/ceph/src/common/assert.cc:78
#4 0x000002aa01a05340 in rbd::mirror::NamespaceReplayer<librbd::(anonymous namespace)::MockTestImageCtx>::create (name="", local_ioctx=..., remote_ioctx=...,
local_mirror_uuid="uuid", remote_mirror_uuid="uuid", site_name="siteA", threads=0x3ffffffe148, image_sync_throttler=0x0, image_deletion_throttler=0x0,
service_daemon=0x3ffffffdf58, cache_manager_handler=0x0) at /home/uweigand/ceph/src/test/rbd_mirror/test_mock_PoolReplayer.cc:141
#5 0x000002aa01a22164 in rbd::mirror::PoolReplayer<librbd::(anonymous namespace)::MockTestImageCtx>::update_namespace_replayers (this=0x3ffffffe610)
at /home/uweigand/ceph/src/tools/rbd_mirror/PoolReplayer.cc:604
#6 0x000002aa01a1f334 in rbd::mirror::PoolReplayer<librbd::(anonymous namespace)::MockTestImageCtx>::<lambda()>::operator()(void) const (this=0x3ffffffe610)
at /home/uweigand/ceph/src/tools/rbd_mirror/PoolReplayer.cc:534
#7 0x000002aa01a2290a in rbd::mirror::PoolReplayer<librbd::(anonymous namespace)::MockTestImageCtx>::with_namespace_replayers<rbd::mirror::PoolReplayer<ImageCtxT>::run() [with ImageCtxT = librbd::(anonymous namespace)::MockTestImageCtx]::<lambda()> >(rbd::mirror::PoolReplayer<librbd::(anonymous namespace)::MockTestImageCtx>::<lambda()> &&) (this=0x3ffffffe610,
callback=...) at /home/uweigand/ceph/src/tools/rbd_mirror/PoolReplayer.h:131
#8 0x000002aa01a1f674 in rbd::mirror::PoolReplayer<librbd::(anonymous namespace)::MockTestImageCtx>::run (this=0x3ffffffe610)
at /home/uweigand/ceph/src/tools/rbd_mirror/PoolReplayer.cc:534
#9 0x000002aa01a1d8a4 in rbd::mirror::PoolReplayer<librbd::(anonymous namespace)::MockTestImageCtx>::PoolReplayerThread::entry (this=0x3ffffffe870)
at /home/uweigand/ceph/src/tools/rbd_mirror/PoolReplayer.h:225
#10 0x000003fff44153ae in Thread::entry_wrapper (this=0x3ffffffe870) at /home/uweigand/ceph/src/common/Thread.cc:84
#11 0x000003fff44152e2 in Thread::_entry_func (arg=0x3ffffffe870) at /home/uweigand/ceph/src/common/Thread.cc:71
#12 0x000003fffde89c9e in start_thread (arg=0x3ffd787f910) at pthread_create.c:486
#13 0x000003fff26ff696 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Note how in frame 4 the ::create routine is called with an invalid "name" argument.

This seems to be caused by a race condition in PoolReplayer<I>::update_namespace_replayers here:

for (auto &name : mirroring_namespaces) {     // A
auto namespace_replayer = NamespaceReplayer&lt;I&gt;::create(
name, m_local_io_ctx, m_remote_io_ctx, m_local_mirror_uuid, m_peer.uuid,
m_site_name, m_threads, m_image_sync_throttler.get(),
m_image_deletion_throttler.get(), m_service_daemon,
m_cache_manager_handler);
auto on_init = new LambdaContext(
[this, namespace_replayer, name, &mirroring_namespaces,
ctx=gather_ctx->new_sub()](int r) {
if (r < 0) {
derr << "failed to initialize namespace replayer for namespace "
<< name << ": " << cpp_strerror(r) << dendl;
delete namespace_replayer;
mirroring_namespaces.erase(name); // B
} else {
std::lock_guard locker{m_lock};
m_namespace_replayers[name] = namespace_replayer;
m_service_daemon->add_namespace(m_local_pool_id, name);
}
ctx->complete(r);
});
namespace_replayer->init(on_init);
}

Note how the loop at A accesses the mirroring_namespaces set in one thread, while at the same time a different thread, executing at B, modifies that same set, without using any locking. Running the test under valgrind/helgrind confirms a potential data race here.

(B is already in an error path, but I believe this is to be expected as the NamespacesError test case specifically validates the behavior in this error scenario.)


Related issues 1 (0 open1 closed)

Copied to rbd - Backport #43504: nautilus: Crash in unittest_rbd_mirror due to race condition in PoolReplayerRejectedMykola GolubActions
Actions #1

Updated by Mykola Golub over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to Mykola Golub
  • Backport set to nautilus
Actions #2

Updated by Mykola Golub over 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 32243
Actions #3

Updated by Mykola Golub over 4 years ago

Ulrich, thank you for the report and the detailed analysis!

Is there chance you could test the fix provided in https://github.com/ceph/ceph/pull/32243 ?

Actions #4

Updated by Mykola Golub over 4 years ago

I set backport to nautlus, because although the current nautilus version is not affected -- rbd-mirror namespaces support is not there, there are plans to backport this feature [1]. If it happens we will need to backport this fix too.

[1] https://tracker.ceph.com/issues/41812

Actions #5

Updated by Ulrich Weigand over 4 years ago

Yes, this fixes the problem. Thanks for the quick fix!

Actions #6

Updated by Jason Dillaman over 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #43504: nautilus: Crash in unittest_rbd_mirror due to race condition in PoolReplayer added
Actions #8

Updated by Mykola Golub over 4 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF