Bug #39992: Multisite sync corruption for large multipart obj - rgw - Ceph

Actions

Copy link

Bug #39992

closed

Multisite sync corruption for large multipart obj

Added by Xiaoxi Chen almost 5 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

luminous mimic nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

we have a two-zone multi-site setup, zone lvs and zone slc respectively. It works fine in general however we got reports from customer about data corruption/mismatch between two zone

root@host:~# s3cmd -c .s3cfg_lvs ls s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index
2019-05-14 04:30 410444223 s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index
root@host-ump:~# s3cmd -c .s3cfg_slc ls s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index
2019-05-14 04:30 62158776 s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index

Object metadata in SLC/LVS can be found in
https://pastebin.com/a5JNb9vb LVS
https://pastebin.com/1MuPJ0k1 SLC

SLC is a single flat object while LVS is a multi-part object, which indicate the object was uploaded by user in LVS and mirrored to SLC.The SLC object get truncated after 62158776, the first 62158776 bytes are right.

root@host:~# cmp -l slc_obj lvs_obj
cmp: EOF on slc_obj after byte 62158776

Both bucket sync status and overall sync status shows positive, and the obj was created 5 days ago. It sounds more like when pulling the object content from source zone(LVS), the transaction was terminated somewhere in between and cause an incomplete obj, and seems we dont have checksum verification in sync_agent so that the corrupted obj was there and be treated as a success sync.

root@host:~# radosgw-admin --cluster slc_ceph_ump bucket sync status --bucket=ms-nsn-prod-48
realm 2305f95c-9ec9-429b-a455-77265585ef68 (metrics)
zonegroup 9dad103a-3c3c-4f3b-87a0-a15e17b40dae (ebay)
zone 6205e53d-6ce4-4e25-a175-9420d6257345 (slc)
bucket ms-nsn-prod-48[017a0848-cf64-4879-b37d-251f72ff9750.432063.48]

source zone 017a0848-cf64-4879-b37d-251f72ff9750 (lvs)
                full sync: 0/16 shards
                incremental sync: 16/16 shards
                bucket is caught up with source

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Xiaoxi Chen almost 5 years ago

Re-sync on the bucket will not solve the inconsistency

radosgw-admin bucket sync init --source-zone lvs --bucket=ms-nsn-prod-48

root@host:~# radosgw-admin bucket sync status --bucket=ms-nsn-prod-48
realm 2305f95c-9ec9-429b-a455-77265585ef68 (metrics)
zonegroup 9dad103a-3c3c-4f3b-87a0-a15e17b40dae (ebay)
zone 6205e53d-6ce4-4e25-a175-9420d6257345 (slc)
bucket ms-nsn-prod-48[017a0848-cf64-4879-b37d-251f72ff9750.432063.48]

source zone 017a0848-cf64-4879-b37d-251f72ff9750 (lvs)
                full sync: 0/16 shards
                incremental sync: 16/16 shards
                bucket is caught up with source

root@lvscephmon01-ump:~# s3cmd -c .s3cfg_slc ls s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index
2019-05-14 04:30 62158776 s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index

Actions

Copy link

Updated by Casey Bodley almost 5 years ago

Status changed from New to 7
Backport set to luminous mimic nautilus

two fixes to backport:
https://github.com/ceph/ceph/pull/28303
https://github.com/ceph/ceph/pull/28345

Actions

Copy link

Updated by Casey Bodley almost 5 years ago

Status changed from 7 to Pending Backport

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Copied to Backport #40144: mimic: Multisite sync corruption for large multipart obj added

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Copied to Backport #40145: nautilus: Multisite sync corruption for large multipart obj added

Actions

Copy link

Updated by Nathan Cutler almost 5 years ago

Copied to Backport #40146: luminous: Multisite sync corruption for large multipart obj added

Actions

Copy link

Updated by Vladimir Bashkirtsev almost 5 years ago

Patches seem to fix the issue but question is now: how to resync invalid files? Any ideas?

Actions

Copy link

Updated by Nathan Cutler about 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #39992

Multisite sync corruption for large multipart obj

Updated by Xiaoxi Chen almost 5 years ago

Updated by Casey Bodley almost 5 years ago

Updated by Casey Bodley almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Vladimir Bashkirtsev almost 5 years ago

Updated by Nathan Cutler about 3 years ago