Bug #21591
closedRGW multisite does not sync all objects
0%
Description
I setup a multisite sync between 2 luminous clusters. The clusters were deployed with ceph-ansible. The Sync seemed to work fine in both directions when I tested with some bucket operations and small objects. However, when I really started using it as storage for a docker registry, I noticed not all objects seem to sync correctly.
master zone:
- s3cmd --config s3cfg_s3_bccl_tda du s3://tda-registry
9090457213 1120 objects s3://tda-registry/
secondary zone:
- s3cmd --config s3cfg_s3_bccm_tda du s3://tda-registry
851591006 943 objects s3://tda-registry/
Altough the buckets are clearly not in sync, the sync status keeps reporting everything is fine and caught up with the source:
master zone:
realm 0f33e8d4-825c-464b-90c5-87a44d99f6fc (tda)
zonegroup 5ce69d1a-097d-4ef7-ae0f-6f356f76de0c (be)
zone 6c82776a-a9c0-46ba-b89a-500958e65b15 (bccl-tda)
metadata sync no sync (zone is master)
data sync source: 2c7a4a95-1922-49fb-bf5f-f550309d611d (bccm-tda)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
secondary zone:
realm 0f33e8d4-825c-464b-90c5-87a44d99f6fc (tda)
zonegroup 5ce69d1a-097d-4ef7-ae0f-6f356f76de0c (be)
zone 2c7a4a95-1922-49fb-bf5f-f550309d611d (bccm-tda)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 6c82776a-a9c0-46ba-b89a-500958e65b15 (bccl-tda)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
When I push some more containers to the registry that is using this bucket, it seems the sync is indeed still working, but some objects seem to be ignored:
master:- s3cmd --config s3cfg_s3_bccl_tda du s3://tda-registry
9585013621 1206 objects s3://tda-registry/
- s3cmd --config s3cfg_s3_bccm_tda du s3://tda-registry
960073106 1025 objects s3://tda-registry/
Judging by the size of the bucket, it looks to me the larger objects are not synced.
When I disable and re-enable the sync on this bucket, the buckets get in sync again. Sometimes I need to disable/enable the sync 2 or 3 times to have all objects in sync.
Any pointers are greatly appreciated.
Files
Updated by Yehuda Sadeh over 6 years ago
Can you provide rgw logs (debug rgw = 20) for the sync process (when it doesn't sync these objects)?
Also, try to look at:
$ radosgw-admin sync error list
Updated by Anonymous over 6 years ago
the sync error list gives around 70 instances of "failed to sync bucket instance: (16) Device or resource busy", but the most recent one is already a few days old.
I'll try to provide the debug logs asap.
Updated by Anonymous over 6 years ago
In the meantime, I updated the clusters to 12.2.1 but the problem persists.
I also see these erros in the secondary site rgw logs:
meta sync: ERROR: failed to read mdlog info with (2) No such file or directory
When I googled this, I found this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1494059
I don't know if related, but I too have this problem:
- radosgw-admin reshard list
[2017-10-09 13:54:18.291188 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000000
2017-10-09 13:54:18.291991 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000001
2017-10-09 13:54:18.292498 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000002
2017-10-09 13:54:18.293024 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000003
2017-10-09 13:54:18.293835 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000004
2017-10-09 13:54:18.294383 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000005
2017-10-09 13:54:18.294918 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000006
2017-10-09 13:54:18.295411 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000007
2017-10-09 13:54:18.295906 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000008
2017-10-09 13:54:18.296457 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000009
2017-10-09 13:54:18.296964 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000010
2017-10-09 13:54:18.297438 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000011
2017-10-09 13:54:18.297900 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000012
2017-10-09 13:54:18.298333 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000013
2017-10-09 13:54:18.298821 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000014
]
2017-10-09 13:54:18.299378 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000015
Still working on the debug logs.
Updated by Anonymous over 6 years ago
Here is a part of the secondary site log, I hope this is sufficient.
Updated by Yehuda Sadeh over 6 years ago
This could indicate a connection issue between secondary and master:
2017-10-09 12:22:04.937108 7faa12a2d700 10 meta sync: cr:s=0x7faa3f751560:op=0x7faa40487800:18RGWMetaSyncShardCR: failed to fetch more log entries, retcode=-11
Also, I do see a lot of messages that say that a log shard is already leased. Either another radosgw process is running, or a lease is still locked from previous run (this looks like a fresh restart), so maybe need to have a few more minutes of log to have these cleared. Can you provide a name of an object that should have been synced?
Updated by Anonymous over 6 years ago
Thanks for your reply
There are 2 rgw's per site, so at that moment another rgw was running. Would you like me to stop it and recapture the debug logs?
Here is the next part of the current debug log, I hope this helps. I will try to find an object that needs synching.
Updated by Anonymous over 6 years ago
- File secondary-single_rgw.log00.bz2 secondary-single_rgw.log00.bz2 added
- File secondary-single_rgw.log01.bz2 secondary-single_rgw.log01.bz2 added
- File secondary-single_rgw.log02.bz2 secondary-single_rgw.log02.bz2 added
- File secondary-single_rgw.log03.bz2 secondary-single_rgw.log03.bz2 added
- File secondary-single_rgw.log04.bz2 secondary-single_rgw.log04.bz2 added
- File secondary-single_rgw.log05.bz2 secondary-single_rgw.log05.bz2 added
- File secondary-single_rgw.log06.bz2 secondary-single_rgw.log06.bz2 added
- File secondary-single_rgw_sync.log00.bz2 secondary-single_rgw_sync.log00.bz2 added
- File secondary-single_rgw_sync.log01.bz2 secondary-single_rgw_sync.log01.bz2 added
- File secondary-single_rgw_sync.log02.bz2 secondary-single_rgw_sync.log02.bz2 added
I created some new debug logs with only 1 rgw running in the secondary site. The first logfile secondary-single_rgw.log is a fresh start with debug turned on. I let it run for a few minutes.
One of the objects that was not being synced was this one:
docker/registry/v2/blobs/sha256/00/00276fc02cc7963b2677f607a414360d0ba6c2d167120167975a8733957bc83e/data
The logfile secondary-single_rgw.log does not contain a single entry of any of the objects that need syncing.
While the rgw was running in debug, I started a new logfile secondary-single_rgw_sync.log and re-enabled the sync with
radosgw-admin bucket sync disable --bucket tda-registry-bug
radosgw-admin bucket sync enable --bucket tda-registry-bug
At that moment the objects started to sync. secondary-single_rgw_sync.log contains the sync.
Updated by Anonymous over 6 years ago
.
Updated by Matt Benjamin over 6 years ago
- Assignee set to Yehuda Sadeh
@Yehuda Sadeh, feel free to re-assign; I know you've been working it on list
Updated by Matt Benjamin over 6 years ago
- Status changed from New to In Progress
Updated by Anonymous over 6 years ago
Since I came across some bug reports that stated you should not use dynamic resharding when doing multisite, I disabled the dynamic resharding on both clusters, and recreated the bucket. The problem persists.
Updated by Anonymous over 6 years ago
Is there an easy way to verify this is the same bug fixed here?: https://github.com/ceph/ceph/pull/18271
Since this is only affecting bigger objects, it might make sense it is an issue with multipart uploads?
Updated by Casey Bodley over 6 years ago
- Related to Bug #21772: multisite: multipart uploads fail to sync added
Updated by Casey Bodley over 6 years ago
Yeah, there is a known issue with multipart object sync in luminous. I've added the related issue.
Updated by Yehuda Sadeh over 6 years ago
could be an issue with sync of resharded buckets. At the moment resharded bucket sync on the secondary zone could lead for the issue we're seeing here.
Updated by Yehuda Sadeh over 6 years ago
ah, nevermind. Casey's last comment is probably correct.
Updated by Yehuda Sadeh over 6 years ago
Can you make sure that this one is fixed once 12.2.2 is out?
Updated by Anonymous over 6 years ago
I certainly will. Hope it will be out soon.
Updated by Casey Bodley over 6 years ago
- Status changed from In Progress to Need More Info
Updated by Anonymous over 6 years ago
I upgraded the clusters to 12.2.2 and the multisite sync works as expected now.
Thanks a lot for the fix, this issue can be closed.
Updated by Casey Bodley over 6 years ago
- Status changed from Need More Info to Resolved