Project

General

Profile

Bug #24775

rgw-multisite: radosgw-admin misreport sync status

Added by Xinying Song over 5 years ago. Updated over 4 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
jewel luminous mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently we get a situation where data in different zones are far away from being consistency, but `radosgw-admin sync status` shows each zone has caught up with others.

We upload data in zone A, and other zones just sync data from zone A. After all data have been uploaded in zone A successfully, run `radosgw-admin bucket stats --bucket=<bucket>` in other zones show data size in those zones are increasing. However, `radosgw-admin sync status` in these zones show they have caught up with zone A.

History

#2 Updated by Casey Bodley over 5 years ago

  • Assignee set to Casey Bodley

#3 Updated by Casey Bodley over 5 years ago

  • Status changed from New to 7

#4 Updated by Casey Bodley over 5 years ago

  • Backport set to jewel luminous mimic

#5 Updated by Xinying Song over 5 years ago

This problem seems not so easy to fix as above pr.
The RGWDataSyncSingleEntryCR is spawned in three ways:
1. out-band data notify
2. keys recorded in error_repo
3. normal log records read from peer zone
Only the third way have the opportunity to update the sync marker. However, due to the bucket lock, RGWDataSyncSingleEntryCR spawned in the third way will fail with retcode -16, which means failed to get a lock. So if we update sync marker only when the retcode 0, there will be a lot of misreports too.
The root cause for misreport is because we don't know if a out-band notify have successfully finished the sync job. I think we should find another way to fix it.

#6 Updated by Patrick Donnelly over 4 years ago

  • Status changed from 7 to Fix Under Review

Also available in: Atom PDF