Bug #58939
openmultisite: Versioned objects left after a deletion on secondary site
0%
Description
While testing a versioned warp benchmark on a Reef multisite cluster, we noticed that the master side was missing object versions that were present on the secondary. The sync had apparently completed as reported by radosgw-admin sync status
and radosgw-admin bucket sync status
. Attempting to reproduce the issue is unreliable, but it usually crops up after a few attempts of
warp versioned --host="${endpoint}" \
--objects 25000 \
--access-key="${access_key}" \
--secret-key="${secret_key}" \
--noclear --concurrent 10 \
--duration 600s --obj.size=1M \
--bucket="${bucket_name}" \
--put-distrib 50 \
--get-distrib 0 \
--stat-distrib 0 \
--delete-distrib 50
Note - the behaviour in this case is that the secondary side has object versions that are not present on the primary side, it can also happen that the primary side has objects not present on the secondary as reported in #58911.
I've been looking at the logs produced by this issue, but I haven't tracked it down yet. From what I have gathered so far, the mechanics of the issue are as follows:
1. Object is created on primary side
2. Object is replicated primary -> secondary during full sync
3. Object is deleted on primary side during full sync
4. Secondary switches to incremental sync, but misses the DELETE from the bilog
At the moment I'm not sure on the mechanism by which the DELETE is missed during replication. The logs assert that incremental sync is starting from the markers assigned during InitBucketFullSyncStatusCR
, which to my understanding should include the entry which doesn't seem to happen. I have noticed that some bilog shards don't seem to be syncing incrementally, but I haven't yet followed this track in depth.