Project

General

Profile

Actions

Bug #17571

closed

multisite: coroutine deadlock assertion on error in FetchAllMetaCR

Added by Casey Bodley over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

An early error in RGWFetchAllMetaCR triggers the deadlock detection assertion in RGWCoroutinesManager::run(). The 4 coroutines logged as 'still running' at that point are RGWOmapAppends spawned by RGWShardedOmapCRManager. The error paths in RGWFetchAllMetaCR need to shut these down before exiting.

2016-10-10 10:47:59.034913 7f39e52e9700  1 -- 10.17.151.111:0/2961098911 <== osd.0 10.17.151.111:6800/26012 798 ==== osd_op_reply(791 mdlog.sync-status [call] v19'301 uv301 ondisk = 0) v7 ==== 137+0+0 (2820548376 0 0) 0x559d4d5cc340 con 0x559d4d5a4000
2016-10-10 10:47:59.035005 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4dce8000:22RGWSimpleRadosUnlockCR: operate()
2016-10-10 10:47:59.035014 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4dce8000:22RGWSimpleRadosUnlockCR: operate()
2016-10-10 10:47:59.035015 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4dce8000:22RGWSimpleRadosUnlockCR: operate()
2016-10-10 10:47:59.035016 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4dce8000:22RGWSimpleRadosUnlockCR: operate()
2016-10-10 10:47:59.035033 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4d5ae000:20RGWContinuousLeaseCR: operate()
2016-10-10 10:47:59.035037 7f39ce2bb700 20 run: stack=0x559d4de6adf0 is done
2016-10-10 10:47:59.035039 7f39ce2bb700 20 cr:s=0x559d4de68000:op=0x559d4e23f500:17RGWFetchAllMetaCR: operate()
2016-10-10 10:47:59.035041 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4de6adf0 is complete
2016-10-10 10:47:59.035042 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4de69860 is still running
2016-10-10 10:47:59.035043 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4dc7cfd0 is still running
2016-10-10 10:47:59.035045 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4dc7ca30 is still running
2016-10-10 10:47:59.035045 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4d45a1c0 is still running
2016-10-10 10:47:59.035047 7f39ce2bb700 20 run: stack=0x559d4de68000 is_blocked_by_stack()=0 is_sleeping=0 waiting_for_child()=1
2016-10-10 10:47:59.039864 7f39ce2bb700 -1 /home/cbodley/ceph/src/rgw/rgw_coroutine.cc: In function 'int RGWCoroutinesManager::run(std::__cxx11::list<RGWCoroutinesStack*>&)' thread 7f39ce2bb700 time 2016-10-10 10:47:59.035048
/home/cbodley/ceph/src/rgw/rgw_coroutine.cc: 590: FAILED assert(context_stacks.empty() || going_down.read())

 ceph version Development (no_version)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x559d439f5f08]
 2: (RGWCoroutinesManager::run(std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0xfed) [0x559d436a8475]
 3: (RGWCoroutinesManager::run(RGWCoroutine*)+0xbc) [0x559d436a884e]
 4: (RGWRemoteMetaLog::run_sync()+0x125f) [0x559d438cb669]
 5: (RGWMetaSyncStatusManager::run()+0x1c) [0x559d4376cdea]
 6: (RGWMetaSyncProcessorThread::process()+0x1c) [0x559d4376f0a2]
 7: (RGWRadosThread::Worker::entry()+0xf6) [0x559d437137a2]
 8: (Thread::entry_wrapper()+0xc1) [0x559d43a1a0f9]
 9: (Thread::_entry_func(void*)+0x18) [0x559d43a1a02e]
 10: (()+0x761a) [0x7f39ef5c061a]
 11: (clone()+0x6d) [0x7f39eddb259d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 2 (0 open2 closed)

Related to rgw - Bug #17568: multisite: race between ReadSyncStatus and InitSyncStatus leads to EIO errorsResolvedCasey Bodley10/13/2016

Actions
Copied to rgw - Backport #17709: jewel: multisite: coroutine deadlock assertion on error in FetchAllMetaCRResolvedLoïc DacharyActions
Actions #1

Updated by Casey Bodley over 7 years ago

  • Related to Bug #17568: multisite: race between ReadSyncStatus and InitSyncStatus leads to EIO errors added
Actions #2

Updated by Casey Bodley over 7 years ago

  • Assignee set to Casey Bodley
Actions #3

Updated by Casey Bodley over 7 years ago

  • Status changed from New to Fix Under Review
  • Backport set to jewel
Actions #4

Updated by Yehuda Sadeh over 7 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Loïc Dachary over 7 years ago

  • Copied to Backport #17709: jewel: multisite: coroutine deadlock assertion on error in FetchAllMetaCR added
Actions #6

Updated by Nathan Cutler about 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF