Project

General

Profile

Bug #17571

multisite: coroutine deadlock assertion on error in FetchAllMetaCR

Added by Casey Bodley 11 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
10/13/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

An early error in RGWFetchAllMetaCR triggers the deadlock detection assertion in RGWCoroutinesManager::run(). The 4 coroutines logged as 'still running' at that point are RGWOmapAppends spawned by RGWShardedOmapCRManager. The error paths in RGWFetchAllMetaCR need to shut these down before exiting.

2016-10-10 10:47:59.034913 7f39e52e9700  1 -- 10.17.151.111:0/2961098911 <== osd.0 10.17.151.111:6800/26012 798 ==== osd_op_reply(791 mdlog.sync-status [call] v19'301 uv301 ondisk = 0) v7 ==== 137+0+0 (2820548376 0 0) 0x559d4d5cc340 con 0x559d4d5a4000
2016-10-10 10:47:59.035005 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4dce8000:22RGWSimpleRadosUnlockCR: operate()
2016-10-10 10:47:59.035014 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4dce8000:22RGWSimpleRadosUnlockCR: operate()
2016-10-10 10:47:59.035015 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4dce8000:22RGWSimpleRadosUnlockCR: operate()
2016-10-10 10:47:59.035016 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4dce8000:22RGWSimpleRadosUnlockCR: operate()
2016-10-10 10:47:59.035033 7f39ce2bb700 20 cr:s=0x559d4de6adf0:op=0x559d4d5ae000:20RGWContinuousLeaseCR: operate()
2016-10-10 10:47:59.035037 7f39ce2bb700 20 run: stack=0x559d4de6adf0 is done
2016-10-10 10:47:59.035039 7f39ce2bb700 20 cr:s=0x559d4de68000:op=0x559d4e23f500:17RGWFetchAllMetaCR: operate()
2016-10-10 10:47:59.035041 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4de6adf0 is complete
2016-10-10 10:47:59.035042 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4de69860 is still running
2016-10-10 10:47:59.035043 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4dc7cfd0 is still running
2016-10-10 10:47:59.035045 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4dc7ca30 is still running
2016-10-10 10:47:59.035045 7f39ce2bb700 20 collect(): s=0x559d4de68000 stack=0x559d4d45a1c0 is still running
2016-10-10 10:47:59.035047 7f39ce2bb700 20 run: stack=0x559d4de68000 is_blocked_by_stack()=0 is_sleeping=0 waiting_for_child()=1
2016-10-10 10:47:59.039864 7f39ce2bb700 -1 /home/cbodley/ceph/src/rgw/rgw_coroutine.cc: In function 'int RGWCoroutinesManager::run(std::__cxx11::list<RGWCoroutinesStack*>&)' thread 7f39ce2bb700 time 2016-10-10 10:47:59.035048
/home/cbodley/ceph/src/rgw/rgw_coroutine.cc: 590: FAILED assert(context_stacks.empty() || going_down.read())

 ceph version Development (no_version)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x559d439f5f08]
 2: (RGWCoroutinesManager::run(std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0xfed) [0x559d436a8475]
 3: (RGWCoroutinesManager::run(RGWCoroutine*)+0xbc) [0x559d436a884e]
 4: (RGWRemoteMetaLog::run_sync()+0x125f) [0x559d438cb669]
 5: (RGWMetaSyncStatusManager::run()+0x1c) [0x559d4376cdea]
 6: (RGWMetaSyncProcessorThread::process()+0x1c) [0x559d4376f0a2]
 7: (RGWRadosThread::Worker::entry()+0xf6) [0x559d437137a2]
 8: (Thread::entry_wrapper()+0xc1) [0x559d43a1a0f9]
 9: (Thread::_entry_func(void*)+0x18) [0x559d43a1a02e]
 10: (()+0x761a) [0x7f39ef5c061a]
 11: (clone()+0x6d) [0x7f39eddb259d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues

Related to rgw - Bug #17568: multisite: race between ReadSyncStatus and InitSyncStatus leads to EIO errors Resolved 10/13/2016
Copied to rgw - Backport #17709: jewel: multisite: coroutine deadlock assertion on error in FetchAllMetaCR Resolved

History

#1 Updated by Casey Bodley 11 months ago

  • Related to Bug #17568: multisite: race between ReadSyncStatus and InitSyncStatus leads to EIO errors added

#2 Updated by Casey Bodley 11 months ago

  • Assignee set to Casey Bodley

#3 Updated by Casey Bodley 11 months ago

  • Status changed from New to Need Review
  • Backport set to jewel

#4 Updated by Yehuda Sadeh 11 months ago

  • Status changed from Need Review to Pending Backport

#5 Updated by Loic Dachary 11 months ago

  • Copied to Backport #17709: jewel: multisite: coroutine deadlock assertion on error in FetchAllMetaCR added

#6 Updated by Nathan Cutler 8 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF