Bug #24603
closedrgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR
0%
Description
with rgw_debug=20, log shows as below( this is original log without any filter):
~~~~~~~rgw log begin~~~~~~~~~~~~~~~~~
...
2018-06-21 17:37:48.791493 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791518 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791527 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791530 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791533 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791536 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791540 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791542 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791545 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791547 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791551 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791555 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791558 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
2018-06-21 17:37:48.791562 7f18c5fd5700 5 data sync: cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: [inc sync] can't do op on key=files/docker/registry/v2/blobs/sha256/44/447f7b07a36a437582ce15010302d233bf28e43dfcc78f97ae4cb90c13df49cb/data need to wait for conflicting operation to complete
2018-06-21 17:37:48.791566 7f18c5fd5700 20 cr:s=0x7f18f6555b20:op=0x7f18fb936800:31RGWBucketShardIncrementalSyncCR: operate()
...
~~~~~~~~~~~~~~rgw log end~~~~~~~~~~~~~~~~~
make breakpoint at RGWBucketShardIncrementalSyncCR::operate shows that `spawned` is empty:
~~~~~~~gdb begin~~~~~~~~~~~
Breakpoint 1, RGWBucketShardIncrementalSyncCR::operate (this=0x7f18fb936800) at /usr/src/debug/ceph-12.2.5.1_sensetime/src/rgw/rgw_data_sync.cc:2521
2521 {
(gdb) p spawned
$1 = {entries = std::vector of length 0, capacity 0}
(gdb) p marker_tracker.key_to_marker
$2 = std::map with 56 elements = {[{name = "files/docker/registry/v2/blobs/sha256/39/39455bb50b023c88fa9eeabef3e851fd778d4b7444418e7d7335a99b165e8b3a/data", instance = "",
ns = ""}] = "00000009686.21305.5", [{name = "files/d..........
~~~~~~~~~~gdb end~~~~~~~~~~~~~~~~~~
The marker_tracker is not empty, but spawned stacks have all finished( finished with error, right?).
Updated by Xinying Song almost 6 years ago
fix pr: https://github.com/ceph/ceph/pull/22660
I'm not 100% sure about this fix. This problem occurs in our production environment every time we do a lot of put/delete. The test environment can't reproduce it, maybe because machines are all in the same room, not geographically far away from each other.
I do not dare change the online code without verification, so I hope someone can prove this idea or give some suggestions.
Thanks.
Updated by Casey Bodley almost 6 years ago
- Status changed from New to Pending Backport
- Backport set to jewel luminous mimic
Updated by Patrick Donnelly almost 6 years ago
- Copied to Backport #24690: luminous: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR added
Updated by Patrick Donnelly almost 6 years ago
- Copied to Backport #24700: mimic: rgw-multisite: endless loop in RGWBucketShardIncrementalSyncCR added
Updated by Nathan Cutler over 5 years ago
- Status changed from Pending Backport to Resolved
- Backport changed from jewel luminous mimic to luminous mimic
Jewel is EOL