Project

General

Profile

Bug #18963

rbd-mirror: forced failover does not function when peer is unreachable

Added by Jason Dillaman 7 months ago. Updated 24 days ago.

Status:
Resolved
Priority:
Normal
Target version:
-
Start date:
02/16/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

When a local image is force promoted to primary, the local rbd-mirror daemon should detect that the local images are now primary and shut-down the image replayers (and release the exclusive lock). However, if the remote peer is unreachable, it can result in deadlock and the image replayers will not shut down correctly.

#0  0x00007f96db88b6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f96dc6c7ad1 in Wait (mutex=..., this=0x7f9636ff9da0) at common/Cond.h:56
#2  librados::IoCtxImpl::operate_read (this=this@entry=0x7f96efdfb050, oid=..., o=o@entry=0x7f9636ff9fc0, pbl=pbl@entry=0x7f9636ffa180, flags=flags@entry=0) at librados/IoCtxImpl.cc:725
#3  0x00007f96dc6d25d3 in librados::IoCtxImpl::exec (this=0x7f96efdfb050, oid=..., cls=cls@entry=0x7f96e649f4c7 "rbd", method=method@entry=0x7f96e64e42e7 "mirror_mode_get", inbl=..., outbl=...) at librados/IoCtxImpl.cc:1135
#4  0x00007f96dc681a74 in librados::IoCtx::exec (this=this@entry=0x7f96efdfb710, oid="rbd_mirroring", cls=cls@entry=0x7f96e649f4c7 "rbd", method=method@entry=0x7f96e64e42e7 "mirror_mode_get", inbl=..., outbl=...) at librados/librados.cc:1273
#5  0x00007f96e638ec7d in librbd::cls_client::mirror_mode_get (ioctx=ioctx@entry=0x7f96efdfb710, mirror_mode=mirror_mode@entry=0x7f9636ffa21c) at cls/rbd/cls_rbd_client.cc:1042
#6  0x00007f96e623bf10 in librbd::mirror_mode_get (io_ctx=..., mirror_mode=mirror_mode@entry=0x7f9636ffa3dc) at librbd/internal.cc:3445
#7  0x00007f96e61d471a in rbd::mirror::PoolWatcher::refresh (this=this@entry=0x7f96efdfb710, image_ids=image_ids@entry=0x7f9636ffa680) at tools/rbd_mirror/PoolWatcher.cc:90
#8  0x00007f96e61d54df in rbd::mirror::PoolWatcher::refresh_images (this=0x7f96efdfb710, reschedule=<optimized out>) at tools/rbd_mirror/PoolWatcher.cc:65
#9  0x00007f96e61b0c9a in operator() (a0=<optimized out>, this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
#10 FunctionContext::finish (this=<optimized out>, r=<optimized out>) at include/Context.h:460
#11 0x00007f96e61aeb89 in Context::complete (this=0x7f954c00d530, r=<optimized out>) at include/Context.h:64
#12 0x00007f96e63ccd24 in SafeTimer::timer_thread (this=0x7f96efdfb730) at common/Timer.cc:105
#13 0x00007f96e63ce75d in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38
#14 0x00007f96db887dc5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f96da77073d in clone () from /lib64/libc.so.6
#0  0x00007f96db88b6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f96dc6c7ad1 in Wait (mutex=..., this=0x7f9596ffa120) at common/Cond.h:56
#2  librados::IoCtxImpl::operate_read (this=this@entry=0x7f96efe66fb0, oid=..., o=o@entry=0x7f9596ffa340, pbl=pbl@entry=0x7f9596ffa500, flags=flags@entry=0) at librados/IoCtxImpl.cc:725
#3  0x00007f96dc6d25d3 in librados::IoCtxImpl::exec (this=0x7f96efe66fb0, oid=..., cls=cls@entry=0x7f96e649f4c7 "rbd", method=method@entry=0x7f96e64e42c7 "mirror_uuid_get", inbl=..., outbl=...) at librados/IoCtxImpl.cc:1135
#4  0x00007f96dc681a74 in librados::IoCtx::exec (this=this@entry=0x7f96efe2d3f8, oid="rbd_mirroring", cls=cls@entry=0x7f96e649f4c7 "rbd", method=method@entry=0x7f96e64e42c7 "mirror_uuid_get", inbl=..., outbl=...) at librados/librados.cc:1273
#5  0x00007f96e638e8dd in librbd::cls_client::mirror_uuid_get (ioctx=ioctx@entry=0x7f96efe2d3f8, uuid=uuid@entry=0x7f9596ffa650) at cls/rbd/cls_rbd_client.cc:1010
Python Exception <type 'exceptions.ValueError'> Cannot find type const rbd::mirror::Replayer::ImageIds::_Rep_type: 
#6  0x00007f96e61ac49f in rbd::mirror::Replayer::set_sources (this=this@entry=0x7f96efe2d2d0, image_ids=std::set with 4 elements) at tools/rbd_mirror/Replayer.cc:631
#7  0x00007f96e61adc47 in rbd::mirror::Replayer::run (this=0x7f96efe2d2d0) at tools/rbd_mirror/Replayer.cc:453
#8  0x00007f96e61b15fd in rbd::mirror::Replayer::ReplayerThread::entry (this=<optimized out>) at tools/rbd_mirror/Replayer.h:125
#9  0x00007f96db887dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f96da77073d in clone () from /lib64/libc.so.6
#0  0x00007f96db88e1bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f96db889d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f96db889c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f96e63c5458 in Mutex::Lock (this=this@entry=0x7f96efdf5ad8, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f96e61a6767 in Locker (m=..., this=<synthetic pointer>) at common/Mutex.h:115
#5  rbd::mirror::Replayer::is_blacklisted (this=0x7f96efdf5ab0) at tools/rbd_mirror/Replayer.cc:263
Python Exception <type 'exceptions.ValueError'> Cannot find type const rbd::mirror::Mirror::PoolPeers::_Rep_type: 
#6  0x00007f96e61a218b in rbd::mirror::Mirror::update_replayers (this=this@entry=0x7f96efdbcbe0, pool_peers=std::map with 3 elements) at tools/rbd_mirror/Mirror.cc:368
#7  0x00007f96e61a2cf6 in rbd::mirror::Mirror::run (this=0x7f96efdbcbe0) at tools/rbd_mirror/Mirror.cc:237
#8  0x00007f96e619a592 in main (argc=<optimized out>, argv=0x7ffe3e072c68) at tools/rbd_mirror/main.cc:74

History

#1 Updated by Jason Dillaman 7 months ago

  • Description updated (diff)

#2 Updated by Jason Dillaman 7 months ago

The individual ImageReplayers are stuck in the STOPPING state, trying to stop the replay of the remote journal. Due to the loss of connectivity with the remote peer, the journal replay cannot be stopped.

#4 Updated by Jason Dillaman 4 months ago

  • Status changed from In Progress to Need Review

#5 Updated by Jason Dillaman 4 months ago

  • Backport deleted (kraken,jewel)

#6 Updated by Mykola Golub 4 months ago

  • Status changed from Need Review to Resolved

#7 Updated by Nathan Cutler 24 days ago

@Jason, @Mykola: Is jewel backport feasible for this fix? Someone is requesting it.

#8 Updated by Jason Dillaman 24 days ago

@Nathan: it's a lot of code to attempt to backport which is why I yanked the backport label -- it's high risk.

Also available in: Atom PDF