Bug #58911
closedmultisite: Race condition in replication causes objects that should be deleted to persist
0%
Description
A race condition in multisite replication can allow objects that should be deleted to be copied back from another site, resulting in inconsistent state between zones. The final behaviour is that the zone which recieves the workload ends up with some objects which should be deleted still present. I've tested this for (active-active) multisite replication between two zones.
The most reliable method of reproducing this I've found thus far is the following warp
command:
warp mixed --host=<endpoint> --access-key=<access key> --secret-key=<secret key> --noclear --objects 250 --put-distrib 50 --delete-distrib 50 --get-distrib 0 --stat-distrib 0 --concurrent 10 --duration 60s
As you can see, this restricts to only PUTs and DELETEs on a single bucket. After running this, and waiting for the bucket sync to finish, I can see that the zone receiving objects has (on this example run) 384 objects, as opposed to 232 for the zone acting as a secondary. Let's call the zone targeted by the workload A, and its peer B in this case. Looking through the logs for a single object present in A but missing in B, we can see the following events occurred:
1. Object was PUT into A
2. Replication begins A->B
3. Object is replicated A->B by Full Sync
4. Object is deleted in A
5. Replication begins B->A
6. Object is replicated back B->A by Full Sync
7. Delete is replicated A->B by Incremental Sync
8. Delete tries to replicate B->A, but is skipped as A is present in the zone trace
At this point, A contains an object that should be deleted. Of course, the type of workload where this could happen is a little strange, in that creation and deletion must happen in quick succession, but I think this is still a problem. Note that in the above steps, (4) and (5) may be reversed with the same result -- we can still hit the problem even if there's already a full sync going on B->A when the object is deleted, provided the delete happens before we attempt to replicate the object back over.
I've only tested this on the main branch, but I don't think there's any reason it need be limited to it.