Project

General

Profile

Bug #24603

rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR

Added by Xinying Song 7 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
Start date:
06/21/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

with rgw_debug=20, log shows as below( this is original log without any filter):
~~~~~~~~rgw log begin~~~~~~~~~~~~~~~~
...
2018-06-21 17:37:48.791493 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791518 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791527 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791530 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791533 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791536 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791540 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791542 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791545 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791547 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791551 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791555 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791558 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791562 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791566 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
...
~~~~~~~~~~~~~~~rgw log end~~~~~~~~~~~~~~~~

make breakpoint at RGWBucketShardIncrementalSyncCR::operate shows that `spawned` is empty:
~~~~~~~~gdb begin~~~~~~~~~~
Breakpoint 1, RGWBucketShardIncrementalSyncCR::operate (this=0x7f18fb936800) at /usr/src/debug/ceph-12.2.5.1_sensetime/src/rgw/rgw_data_sync.cc:2521
2521 {
(gdb) p spawned
$1 = {entries = std::vector of length 0, capacity 0}
(gdb) p marker_tracker.key_to_marker
$2 = std::map with 56 elements = {[{name = "files/docker/registry/v2/blobs/sha256/39/39455bb50b023c88fa9eeabef3e851fd778d4b7444418e7d7335a99b165e8b3a/data", instance = "",
ns = ""}] = "00000009686.21305.5", [{name = "files/d..........
~~~~~~~~~~~gdb end~~~~~~~~~~~~~~~~~

The marker_tracker is not empty, but spawned stacks have all finished( finished with error, right?).


Related issues

Copied to rgw - Backport #24690: luminous: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR Resolved
Copied to rgw - Backport #24700: mimic: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR Resolved

History

#1 Updated by Xinying Song 7 months ago

fix pr: https://github.com/ceph/ceph/pull/22660

I'm not 100% sure about this fix. This problem occurs in our production environment every time we do a lot of put/delete. The test environment can't reproduce it, maybe because machines are all in the same room, not geographically far away from each other.
I do not dare change the online code without verification, so I hope someone can prove this idea or give some suggestions.

Thanks.

#2 Updated by Casey Bodley 7 months ago

  • Status changed from New to Pending Backport
  • Backport set to jewel luminous mimic

#3 Updated by Patrick Donnelly 7 months ago

  • Copied to Backport #24690: luminous: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR added

#5 Updated by Patrick Donnelly 7 months ago

  • Copied to Backport #24700: mimic: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR added

#6 Updated by Nathan Cutler 3 months ago

  • Status changed from Pending Backport to Resolved
  • Backport changed from jewel luminous mimic to luminous mimic

Jewel is EOL

Also available in: Atom PDF