Project

General

Profile

Actions

Bug #18963

closed

rbd-mirror: forced failover does not function when peer is unreachable

Added by Jason Dillaman about 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Jason Dillaman
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When a local image is force promoted to primary, the local rbd-mirror daemon should detect that the local images are now primary and shut-down the image replayers (and release the exclusive lock). However, if the remote peer is unreachable, it can result in deadlock and the image replayers will not shut down correctly.

#0  0x00007f96db88b6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f96dc6c7ad1 in Wait (mutex=..., this=0x7f9636ff9da0) at common/Cond.h:56
#2  librados::IoCtxImpl::operate_read (this=this@entry=0x7f96efdfb050, oid=..., o=o@entry=0x7f9636ff9fc0, pbl=pbl@entry=0x7f9636ffa180, flags=flags@entry=0) at librados/IoCtxImpl.cc:725
#3  0x00007f96dc6d25d3 in librados::IoCtxImpl::exec (this=0x7f96efdfb050, oid=..., cls=cls@entry=0x7f96e649f4c7 "rbd", method=method@entry=0x7f96e64e42e7 "mirror_mode_get", inbl=..., outbl=...) at librados/IoCtxImpl.cc:1135
#4  0x00007f96dc681a74 in librados::IoCtx::exec (this=this@entry=0x7f96efdfb710, oid="rbd_mirroring", cls=cls@entry=0x7f96e649f4c7 "rbd", method=method@entry=0x7f96e64e42e7 "mirror_mode_get", inbl=..., outbl=...) at librados/librados.cc:1273
#5  0x00007f96e638ec7d in librbd::cls_client::mirror_mode_get (ioctx=ioctx@entry=0x7f96efdfb710, mirror_mode=mirror_mode@entry=0x7f9636ffa21c) at cls/rbd/cls_rbd_client.cc:1042
#6  0x00007f96e623bf10 in librbd::mirror_mode_get (io_ctx=..., mirror_mode=mirror_mode@entry=0x7f9636ffa3dc) at librbd/internal.cc:3445
#7  0x00007f96e61d471a in rbd::mirror::PoolWatcher::refresh (this=this@entry=0x7f96efdfb710, image_ids=image_ids@entry=0x7f9636ffa680) at tools/rbd_mirror/PoolWatcher.cc:90
#8  0x00007f96e61d54df in rbd::mirror::PoolWatcher::refresh_images (this=0x7f96efdfb710, reschedule=<optimized out>) at tools/rbd_mirror/PoolWatcher.cc:65
#9  0x00007f96e61b0c9a in operator() (a0=<optimized out>, this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
#10 FunctionContext::finish (this=<optimized out>, r=<optimized out>) at include/Context.h:460
#11 0x00007f96e61aeb89 in Context::complete (this=0x7f954c00d530, r=<optimized out>) at include/Context.h:64
#12 0x00007f96e63ccd24 in SafeTimer::timer_thread (this=0x7f96efdfb730) at common/Timer.cc:105
#13 0x00007f96e63ce75d in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38
#14 0x00007f96db887dc5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f96da77073d in clone () from /lib64/libc.so.6
#0  0x00007f96db88b6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f96dc6c7ad1 in Wait (mutex=..., this=0x7f9596ffa120) at common/Cond.h:56
#2  librados::IoCtxImpl::operate_read (this=this@entry=0x7f96efe66fb0, oid=..., o=o@entry=0x7f9596ffa340, pbl=pbl@entry=0x7f9596ffa500, flags=flags@entry=0) at librados/IoCtxImpl.cc:725
#3  0x00007f96dc6d25d3 in librados::IoCtxImpl::exec (this=0x7f96efe66fb0, oid=..., cls=cls@entry=0x7f96e649f4c7 "rbd", method=method@entry=0x7f96e64e42c7 "mirror_uuid_get", inbl=..., outbl=...) at librados/IoCtxImpl.cc:1135
#4  0x00007f96dc681a74 in librados::IoCtx::exec (this=this@entry=0x7f96efe2d3f8, oid="rbd_mirroring", cls=cls@entry=0x7f96e649f4c7 "rbd", method=method@entry=0x7f96e64e42c7 "mirror_uuid_get", inbl=..., outbl=...) at librados/librados.cc:1273
#5  0x00007f96e638e8dd in librbd::cls_client::mirror_uuid_get (ioctx=ioctx@entry=0x7f96efe2d3f8, uuid=uuid@entry=0x7f9596ffa650) at cls/rbd/cls_rbd_client.cc:1010
Python Exception <type 'exceptions.ValueError'> Cannot find type const rbd::mirror::Replayer::ImageIds::_Rep_type: 
#6  0x00007f96e61ac49f in rbd::mirror::Replayer::set_sources (this=this@entry=0x7f96efe2d2d0, image_ids=std::set with 4 elements) at tools/rbd_mirror/Replayer.cc:631
#7  0x00007f96e61adc47 in rbd::mirror::Replayer::run (this=0x7f96efe2d2d0) at tools/rbd_mirror/Replayer.cc:453
#8  0x00007f96e61b15fd in rbd::mirror::Replayer::ReplayerThread::entry (this=<optimized out>) at tools/rbd_mirror/Replayer.h:125
#9  0x00007f96db887dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f96da77073d in clone () from /lib64/libc.so.6
#0  0x00007f96db88e1bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f96db889d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f96db889c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f96e63c5458 in Mutex::Lock (this=this@entry=0x7f96efdf5ad8, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f96e61a6767 in Locker (m=..., this=<synthetic pointer>) at common/Mutex.h:115
#5  rbd::mirror::Replayer::is_blacklisted (this=0x7f96efdf5ab0) at tools/rbd_mirror/Replayer.cc:263
Python Exception <type 'exceptions.ValueError'> Cannot find type const rbd::mirror::Mirror::PoolPeers::_Rep_type: 
#6  0x00007f96e61a218b in rbd::mirror::Mirror::update_replayers (this=this@entry=0x7f96efdbcbe0, pool_peers=std::map with 3 elements) at tools/rbd_mirror/Mirror.cc:368
#7  0x00007f96e61a2cf6 in rbd::mirror::Mirror::run (this=0x7f96efdbcbe0) at tools/rbd_mirror/Mirror.cc:237
#8  0x00007f96e619a592 in main (argc=<optimized out>, argv=0x7ffe3e072c68) at tools/rbd_mirror/main.cc:74
Actions #1

Updated by Jason Dillaman about 7 years ago

  • Description updated (diff)
Actions #2

Updated by Jason Dillaman about 7 years ago

The individual ImageReplayers are stuck in the STOPPING state, trying to stop the replay of the remote journal. Due to the loss of connectivity with the remote peer, the journal replay cannot be stopped.

Actions #3

Updated by Jason Dillaman almost 7 years ago

Actions #4

Updated by Jason Dillaman almost 7 years ago

  • Status changed from In Progress to Fix Under Review
Actions #5

Updated by Jason Dillaman almost 7 years ago

  • Backport deleted (kraken,jewel)
Actions #6

Updated by Mykola Golub almost 7 years ago

  • Status changed from Fix Under Review to Resolved
Actions #7

Updated by Nathan Cutler over 6 years ago

@Jason Borden, @Mykola: Is jewel backport feasible for this fix? Someone is requesting it.

Actions #8

Updated by Jason Dillaman over 6 years ago

@Nathan Weinberg: it's a lot of code to attempt to backport which is why I yanked the backport label -- it's high risk.

Actions #9

Updated by liuzhong chen about 6 years ago

@Jason Borden Dillaman this looks like a big bug.does this issue mean if the remote server has been down and cannot be up again.I can not promote the local non-primary iamge to primary?

Actions

Also available in: Atom PDF