Project

General

Profile

Actions

Bug #17465

closed

multisite: coroutine deadlock in RGWMetaSyncCR after ECANCELED errors

Added by Casey Bodley over 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

An assert that detects deadlocks between coroutines in RGWCoroutinesManager::run() is triggered when a lot of coroutines fail at the same time with an ECANCELED error from cls. All of these failing cls operations originated from a single osd (osd.2).

2016-09-16 22:21:58.394794 28ec5700  0 meta sync: ERROR: can't store key: bucket.instance:test-client.0-etmo730qle4efwc-271:r0z1.4146.272 ret=-125
2016-09-16 22:21:58.399995 2b6ca700  0 meta sync: ERROR: can't store key: bucket.instance:test-client.0-etmo730qle4efwc-271:r0z1.4146.272 ret=-125
2016-09-16 22:21:58.400907 372e1700 20 cr:s=0x7690e510:op=0x78c40470:20RGWSimpleRadosLockCR: operate()
2016-09-16 22:21:58.401008 372e1700 20 cr:s=0x7690e510:op=0x78c40470:20RGWSimpleRadosLockCR: operate()
2016-09-16 22:21:58.401059 372e1700 20 cr:s=0x7690e510:op=0x78c40470:20RGWSimpleRadosLockCR: operate()
2016-09-16 22:21:58.401107 372e1700 20 cr:s=0x7690e510:op=0x78c40470:20RGWSimpleRadosLockCR: operate()
2016-09-16 22:21:58.401369 372e1700 20 cr:s=0x7690e510:op=0x7690db10:20RGWContinuousLeaseCR: operate()
2016-09-16 22:21:58.401485 372e1700 20 run: stack=0x7690e510 is io blocked
2016-09-16 22:21:58.401731 372e1700 20 cr:s=0x77890f80:op=0x7e3e63b0:19RGWMetaStoreEntryCR: operate()
2016-09-16 22:21:58.401782 372e1700 20 cr:s=0x77890f80:op=0x7e3e63b0:19RGWMetaStoreEntryCR: operate() returned r=-125
2016-09-16 22:21:58.401877 372e1700 20 cr:s=0x77890f80:op=0x77890830:24RGWMetaSyncSingleEntryCR: operate()
2016-09-16 22:21:58.401930 372e1700 20 meta sync: cr:s=0x77890f80:op=0x77890830:24RGWMetaSyncSingleEntryCR: failed to store metadata: bucket.instance:test-client.0-etmo730qle4efwc-271:r0z1.4146.272, got retcode=-125
2016-09-16 22:21:58.402115 372e1700 20 cr:s=0x77ed92d0:op=0x791a8570:19RGWMetaStoreEntryCR: operate()
2016-09-16 22:21:58.402156 372e1700 20 cr:s=0x77ed92d0:op=0x791a8570:19RGWMetaStoreEntryCR: operate() returned r=-125
2016-09-16 22:21:58.402216 372e1700 20 cr:s=0x77890f80:op=0x7a186200:19RGWMetaStoreEntryCR: operate()
2016-09-16 22:21:58.402276 372e1700 20 cr:s=0x77ed92d0:op=0x77ed8c80:24RGWMetaSyncSingleEntryCR: operate()
2016-09-16 22:21:58.402331 372e1700 20 cr:s=0x77ed92d0:op=0x77ed8c80:24RGWMetaSyncSingleEntryCR: operate() returned r=-125
2016-09-16 22:21:58.402407 372e1700 20 stack->operate() returned ret=-125
2016-09-16 22:21:58.402443 372e1700 20 run: stack=0x77ed92d0 is done
2016-09-16 22:21:58.493422 372e1700 -1 /srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.0/src/rgw/rgw_coroutine.cc: In function 'int RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*>&)' thread 372e1700 time 2016-09-16 22:21:58.403580
/srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.0/src/rgw/rgw_coroutine.cc: 590: FAILED assert(context_stacks.empty() || going_down.read())

 ceph version v11.0.0-2288-gceae10b (ceae10b143c9ef54602b51915cd3fb4475b3c3c1)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x5c76e5]
 2: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0xea0) [0x3729e0]
 3: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x372b80]
 4: (RGWRemoteMetaLog::run_sync()+0xf2b) [0x50279b]
 5: (RGWMetaSyncProcessorThread::process()+0xd) [0x4070fd]
 6: (RGWRadosThread::Worker::entry()+0x133) [0x3a9413]
 7: (()+0x7dc5) [0x149addc5]
 8: (clone()+0x6d) [0x16185ced]

http://qa-proxy.ceph.com/teuthology/cbodley-2016-09-16_10:00:36-rgw-wip-cbodley-testing---basic-mira/419161/teuthology.log
http://qa-proxy.ceph.com/teuthology/owasserm-2016-09-26_16:09:01-rgw-wip-orit-testing---basic-mira/438706/teuthology.log


Related issues 3 (1 open2 closed)

Related to rgw - Bug #17574: multisite: many duplicate mdlog entries cause race to sync and result in ECANCELEDNewCasey Bodley10/13/2016

Actions
Copied to rgw - Backport #18286: jewel: multisite: coroutine deadlock in RGWMetaSyncCR after ECANCELED errorsResolvedNathan CutlerActions
Copied to rgw - Backport #18287: kraken: multisite: coroutine deadlock in RGWMetaSyncCR after ECANCELED errorsClosedActions
Actions

Also available in: Atom PDF