Project

General

Profile

Actions

Bug #21591

closed

RGW multisite does not sync all objects

Added by Anonymous over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I setup a multisite sync between 2 luminous clusters. The clusters were deployed with ceph-ansible. The Sync seemed to work fine in both directions when I tested with some bucket operations and small objects. However, when I really started using it as storage for a docker registry, I noticed not all objects seem to sync correctly.

master zone:

  1. s3cmd --config s3cfg_s3_bccl_tda du s3://tda-registry
    9090457213 1120 objects s3://tda-registry/

secondary zone:

  1. s3cmd --config s3cfg_s3_bccm_tda du s3://tda-registry
    851591006 943 objects s3://tda-registry/

Altough the buckets are clearly not in sync, the sync status keeps reporting everything is fine and caught up with the source:

master zone:

realm 0f33e8d4-825c-464b-90c5-87a44d99f6fc (tda)
zonegroup 5ce69d1a-097d-4ef7-ae0f-6f356f76de0c (be)
zone 6c82776a-a9c0-46ba-b89a-500958e65b15 (bccl-tda)
metadata sync no sync (zone is master)
data sync source: 2c7a4a95-1922-49fb-bf5f-f550309d611d (bccm-tda)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source

secondary zone:
realm 0f33e8d4-825c-464b-90c5-87a44d99f6fc (tda)
zonegroup 5ce69d1a-097d-4ef7-ae0f-6f356f76de0c (be)
zone 2c7a4a95-1922-49fb-bf5f-f550309d611d (bccm-tda)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 6c82776a-a9c0-46ba-b89a-500958e65b15 (bccl-tda)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source

When I push some more containers to the registry that is using this bucket, it seems the sync is indeed still working, but some objects seem to be ignored:

master:
  1. s3cmd --config s3cfg_s3_bccl_tda du s3://tda-registry
    9585013621 1206 objects s3://tda-registry/
secondary:
  1. s3cmd --config s3cfg_s3_bccm_tda du s3://tda-registry
    960073106 1025 objects s3://tda-registry/

Judging by the size of the bucket, it looks to me the larger objects are not synced.

When I disable and re-enable the sync on this bucket, the buckets get in sync again. Sometimes I need to disable/enable the sync 2 or 3 times to have all objects in sync.

Any pointers are greatly appreciated.


Files

secondary-rgw-rgw1.log00.bz2 (869 KB) secondary-rgw-rgw1.log00.bz2 secondary site, one of the rgw's Anonymous, 10/09/2017 12:07 PM
secondary-rgw-rgw1.log01.bz2 (868 KB) secondary-rgw-rgw1.log01.bz2 next part of debug log secondary site, first rgw Anonymous, 10/10/2017 05:45 AM
secondary-single_rgw.log00.bz2 (568 KB) secondary-single_rgw.log00.bz2 Anonymous, 10/10/2017 07:53 AM
secondary-single_rgw.log01.bz2 (555 KB) secondary-single_rgw.log01.bz2 Anonymous, 10/10/2017 07:53 AM
secondary-single_rgw.log02.bz2 (551 KB) secondary-single_rgw.log02.bz2 Anonymous, 10/10/2017 07:53 AM
secondary-single_rgw.log03.bz2 (556 KB) secondary-single_rgw.log03.bz2 Anonymous, 10/10/2017 07:53 AM
secondary-single_rgw.log04.bz2 (550 KB) secondary-single_rgw.log04.bz2 Anonymous, 10/10/2017 07:53 AM
secondary-single_rgw.log05.bz2 (554 KB) secondary-single_rgw.log05.bz2 Anonymous, 10/10/2017 07:53 AM
secondary-single_rgw.log06.bz2 (550 KB) secondary-single_rgw.log06.bz2 Anonymous, 10/10/2017 07:53 AM
secondary-single_rgw_sync.log00.bz2 (608 KB) secondary-single_rgw_sync.log00.bz2 Anonymous, 10/10/2017 07:54 AM
secondary-single_rgw_sync.log01.bz2 (640 KB) secondary-single_rgw_sync.log01.bz2 Anonymous, 10/10/2017 07:54 AM
secondary-single_rgw_sync.log02.bz2 (585 KB) secondary-single_rgw_sync.log02.bz2 Anonymous, 10/10/2017 07:54 AM
secondary-single_rgw_sync.log03.bz2 (380 KB) secondary-single_rgw_sync.log03.bz2 Anonymous, 10/10/2017 07:55 AM

Related issues 1 (0 open1 closed)

Related to rgw - Bug #21772: multisite: multipart uploads fail to syncResolvedCasey Bodley10/12/2017

Actions
Actions #1

Updated by Yehuda Sadeh over 6 years ago

Can you provide rgw logs (debug rgw = 20) for the sync process (when it doesn't sync these objects)?
Also, try to look at:

$ radosgw-admin sync error list

Actions #2

Updated by Yehuda Sadeh over 6 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Anonymous over 6 years ago

the sync error list gives around 70 instances of "failed to sync bucket instance: (16) Device or resource busy", but the most recent one is already a few days old.

I'll try to provide the debug logs asap.

Actions #4

Updated by Anonymous over 6 years ago

In the meantime, I updated the clusters to 12.2.1 but the problem persists.

I also see these erros in the secondary site rgw logs:

meta sync: ERROR: failed to read mdlog info with (2) No such file or directory

When I googled this, I found this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1494059

I don't know if related, but I too have this problem:

  1. radosgw-admin reshard list
    [2017-10-09 13:54:18.291188 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000000
    2017-10-09 13:54:18.291991 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000001
    2017-10-09 13:54:18.292498 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000002
    2017-10-09 13:54:18.293024 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000003
    2017-10-09 13:54:18.293835 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000004
    2017-10-09 13:54:18.294383 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000005
    2017-10-09 13:54:18.294918 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000006
    2017-10-09 13:54:18.295411 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000007
    2017-10-09 13:54:18.295906 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000008
    2017-10-09 13:54:18.296457 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000009
    2017-10-09 13:54:18.296964 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000010
    2017-10-09 13:54:18.297438 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000011
    2017-10-09 13:54:18.297900 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000012
    2017-10-09 13:54:18.298333 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000013
    2017-10-09 13:54:18.298821 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000014
    ]
    2017-10-09 13:54:18.299378 7f9a8cbe5c40 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000015

Still working on the debug logs.

Actions #5

Updated by Anonymous over 6 years ago

Here is a part of the secondary site log, I hope this is sufficient.

Actions #6

Updated by Yehuda Sadeh over 6 years ago

This could indicate a connection issue between secondary and master:

2017-10-09 12:22:04.937108 7faa12a2d700 10 meta sync: cr:s=0x7faa3f751560:op=0x7faa40487800:18RGWMetaSyncShardCR: failed to fetch more log entries, retcode=-11

Also, I do see a lot of messages that say that a log shard is already leased. Either another radosgw process is running, or a lease is still locked from previous run (this looks like a fresh restart), so maybe need to have a few more minutes of log to have these cleared. Can you provide a name of an object that should have been synced?

Actions #7

Updated by Anonymous over 6 years ago

Thanks for your reply

There are 2 rgw's per site, so at that moment another rgw was running. Would you like me to stop it and recapture the debug logs?

Here is the next part of the current debug log, I hope this helps. I will try to find an object that needs synching.

Updated by Anonymous over 6 years ago

I created some new debug logs with only 1 rgw running in the secondary site. The first logfile secondary-single_rgw.log is a fresh start with debug turned on. I let it run for a few minutes.

One of the objects that was not being synced was this one:

docker/registry/v2/blobs/sha256/00/00276fc02cc7963b2677f607a414360d0ba6c2d167120167975a8733957bc83e/data

The logfile secondary-single_rgw.log does not contain a single entry of any of the objects that need syncing.

While the rgw was running in debug, I started a new logfile secondary-single_rgw_sync.log and re-enabled the sync with

radosgw-admin bucket sync disable --bucket tda-registry-bug
radosgw-admin bucket sync enable --bucket tda-registry-bug

At that moment the objects started to sync. secondary-single_rgw_sync.log contains the sync.

Actions #10

Updated by Matt Benjamin over 6 years ago

  • Assignee set to Yehuda Sadeh

@Yehuda Sadeh, feel free to re-assign; I know you've been working it on list

Actions #11

Updated by Matt Benjamin over 6 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Anonymous over 6 years ago

Since I came across some bug reports that stated you should not use dynamic resharding when doing multisite, I disabled the dynamic resharding on both clusters, and recreated the bucket. The problem persists.

Actions #13

Updated by Anonymous over 6 years ago

Is there an easy way to verify this is the same bug fixed here?: https://github.com/ceph/ceph/pull/18271

Since this is only affecting bigger objects, it might make sense it is an issue with multipart uploads?

Actions #14

Updated by Casey Bodley over 6 years ago

  • Related to Bug #21772: multisite: multipart uploads fail to sync added
Actions #15

Updated by Casey Bodley over 6 years ago

Yeah, there is a known issue with multipart object sync in luminous. I've added the related issue.

Actions #16

Updated by Yehuda Sadeh over 6 years ago

could be an issue with sync of resharded buckets. At the moment resharded bucket sync on the secondary zone could lead for the issue we're seeing here.

Actions #17

Updated by Yehuda Sadeh over 6 years ago

ah, nevermind. Casey's last comment is probably correct.

Actions #18

Updated by Yehuda Sadeh over 6 years ago

Can you make sure that this one is fixed once 12.2.2 is out?

Actions #19

Updated by Anonymous over 6 years ago

I certainly will. Hope it will be out soon.

Actions #20

Updated by Casey Bodley over 6 years ago

  • Status changed from In Progress to Need More Info
Actions #21

Updated by Anonymous over 6 years ago

I upgraded the clusters to 12.2.2 and the multisite sync works as expected now.

Thanks a lot for the fix, this issue can be closed.

Actions #22

Updated by Casey Bodley over 6 years ago

  • Status changed from Need More Info to Resolved
Actions

Also available in: Atom PDF