Project

General

Profile

Actions

Bug #24603

closed

rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR

Added by Xinying Song almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
luminous mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

with rgw_debug=20, log shows as below( this is original log without any filter):
~~~~~~~rgw log begin~~~~~~~~~~~~~~~~~
...
2018-06-21 17:37:48.791493 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791518 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791527 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791530 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791533 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791536 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791540 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791542 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791545 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791547 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791551 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791555 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791558 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791562 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791566 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
...
~~~~~~~~~~~~~~rgw log end~~~~~~~~~~~~~~~~~

make breakpoint at RGWBucketShardIncrementalSyncCR::operate shows that `spawned` is empty:
~~~~~~~gdb begin~~~~~~~~~~~
Breakpoint 1, RGWBucketShardIncrementalSyncCR::operate (this=0x7f18fb936800) at /usr/src/debug/ceph-12.2.5.1_sensetime/src/rgw/rgw_data_sync.cc:2521
2521 {
(gdb) p spawned
$1 = {entries = std::vector of length 0, capacity 0}
(gdb) p marker_tracker.key_to_marker
$2 = std::map with 56 elements = {[{name = "files/docker/registry/v2/blobs/sha256/39/39455bb50b023c88fa9eeabef3e851fd778d4b7444418e7d7335a99b165e8b3a/data", instance = "",
ns = ""}] = "00000009686.21305.5", [{name = "files/d..........
~~~~~~~~~~gdb end~~~~~~~~~~~~~~~~~~

The marker_tracker is not empty, but spawned stacks have all finished( finished with error, right?).


Related issues 2 (0 open2 closed)

Copied to rgw - Backport #24690: luminous: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCRResolvedNathan CutlerActions
Copied to rgw - Backport #24700: mimic: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCRResolvedNathan CutlerActions
Actions #1

Updated by Xinying Song almost 6 years ago

fix pr: https://github.com/ceph/ceph/pull/22660

I'm not 100% sure about this fix. This problem occurs in our production environment every time we do a lot of put/delete. The test environment can't reproduce it, maybe because machines are all in the same room, not geographically far away from each other.
I do not dare change the online code without verification, so I hope someone can prove this idea or give some suggestions.

Thanks.

Actions #2

Updated by Casey Bodley almost 6 years ago

  • Status changed from New to Pending Backport
  • Backport set to jewel luminous mimic
Actions #3

Updated by Patrick Donnelly almost 6 years ago

  • Copied to Backport #24690: luminous: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR added
Actions #5

Updated by Patrick Donnelly almost 6 years ago

  • Copied to Backport #24700: mimic: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR added
Actions #6

Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved
  • Backport changed from jewel luminous mimic to luminous mimic

Jewel is EOL

Actions

Also available in: Atom PDF