Project

General

Profile

Actions

Bug #55390

closed

rgw-ms/resharding: Observing sync inconsistencies ~50K out of 20M objects, did not sync.

Added by Vidushi Mishra about 2 years ago. Updated almost 2 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
multisite-reshard
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Files

55390_ sync error list - sec (6.08 KB) 55390_ sync error list - sec sync error list on the secondary Vidushi Mishra, 04/21/2022 07:09 AM
Actions #1

Updated by Vidushi Mishra about 2 years ago

1. ceph version 17.0.0-10783-ge38464a1 (e38464a10ae9e8c7b43bae5a9a7395eb2cbb2444) quincy (dev)

2. Steps to reproduce:

i. Create a multi-site with 14 rgws on each site [4 for ms sync and 10 for client IO.]
ii. The 4 rgws for ms sync are not behind any LB.
iii. create a bucket 'test-sync-no-lb-1' and upload 20M objects [10M from either site.]
iv. Wait for the workload to complete.
v. Monitor sync, and wait for the sync to complete.

3. Result:

i. We observe 411672 objects not synced to the secondary.
ii. sync status on the secondary site reports 128 shards recovering.

4. Additional info:

i. On both the sites, 'ceph s' shows pgs are in backfilling state.
ii. logs:
- period get : http://magna002.ceph.redhat.com/ceph-qe-logs/vidushi/upstream-dbr-2022/55390/period-get
- primary bucket stats : http://magna002.ceph.redhat.com/ceph-qe-logs/vidushi/upstream-dbr-2022/55390/bucket-stats-pri
- secondary bucket stats
http://magna002.ceph.redhat.com/ceph-qe-logs/vidushi/upstream-dbr-2022/55390/bucket-stats-sec
- ceph status secondary- http://magna002.ceph.redhat.com/ceph-qe-logs/vidushi/upstream-dbr-2022/55390/ceph-s_sec

Actions #3

Updated by Vidushi Mishra about 2 years ago

We see error like failed to sync object(2300) Unknown error 2300 in the 'radosgw-admin sync error list' on the secondary site.

Actions #4

Updated by Casey Bodley almost 2 years ago

  • Status changed from New to Can't reproduce

if the cluster isn't healthy, we can't really treat this as an rgw bug

Actions

Also available in: Atom PDF