Bug #58911: multisite: Race condition in replication causes objects that should be deleted to persist - rgw - Ceph

Actions

Copy link

Bug #58911

closed

multisite: Race condition in replication causes objects that should be deleted to persist

Added by Tom Coldrick about 1 year ago. Updated 10 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Tom Coldrick

Target version:

% Done:

Source:

Tags:

multisite backport_processed

Backport:

reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v18.0.0

ceph-qa-suite:

Pull request ID:

51715

Crash signature (v1):

Crash signature (v2):

Description

A race condition in multisite replication can allow objects that should be deleted to be copied back from another site, resulting in inconsistent state between zones. The final behaviour is that the zone which recieves the workload ends up with some objects which should be deleted still present. I've tested this for (active-active) multisite replication between two zones.

The most reliable method of reproducing this I've found thus far is the following warp command:

warp mixed --host=<endpoint> --access-key=<access key> --secret-key=<secret key> --noclear --objects 250 --put-distrib 50 --delete-distrib 50 --get-distrib 0 --stat-distrib 0 --concurrent 10 --duration 60s

As you can see, this restricts to only PUTs and DELETEs on a single bucket. After running this, and waiting for the bucket sync to finish, I can see that the zone receiving objects has (on this example run) 384 objects, as opposed to 232 for the zone acting as a secondary. Let's call the zone targeted by the workload A, and its peer B in this case. Looking through the logs for a single object present in A but missing in B, we can see the following events occurred:

1. Object was PUT into A
2. Replication begins A->B
3. Object is replicated A->B by Full Sync
4. Object is deleted in A
5. Replication begins B->A
6. Object is replicated back B->A by Full Sync
7. Delete is replicated A->B by Incremental Sync
8. Delete tries to replicate B->A, but is skipped as A is present in the zone trace

At this point, A contains an object that should be deleted. Of course, the type of workload where this could happen is a little strange, in that creation and deletion must happen in quick succession, but I think this is still a problem. Note that in the above steps, (4) and (5) may be reversed with the same result -- we can still hit the problem even if there's already a full sync going on B->A when the object is deleted, provided the delete happens before we attempt to replicate the object back over.

I've only tested this on the main branch, but I don't think there's any reason it need be limited to it.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Tom Coldrick 12 months ago

I've spent some time trying to properly understand the mechanics of what's going on here. In particular, I began to wonder why deletion is a problem specifically, and updates don't suffer the same re-replication issue. It turns out that the reason is in how write conflicts are resolved -- should A, B both have an update at similar times, then the highest timestamp wins. This works (ignoring clock skew) fine for updates, but we can't check the timestamp at which a delete operation happened -- we get the timestamp from the object, which no longer exists. As a result, later deletes are always overwritten by older puts should a conflict arise.

Now, conflicts such as this are uncommon -- first you need a workload with heavy PUT/DELETE traffic, such that the replication and deletion can race to begin with. Secondly, in the set up described above, the race has to occur during full sync -- incremental sync uses the zone trace to skip the re-replications -- but if both sides were accepting traffic, then such a conflict could occur at any time. Thirdly, there exists a mechanism already in the RGW to resolve the conflicts, but it has limitations, which I'll discuss now.

In the RGWRados class, we have an LRU map (the tombstone cache) of recently deleted objects and the time of the deletes, which is checked during writes. If we find a later timestamp for our object, then we don't bother doing the write, since it's older than the delete. This would solve the issue! However, the LRU map is limited:

1. (unless I'm missing something) it's instance local -- if a cluster has multiple RGWs accepting traffic, then there's no guarantee that we will hit the tombstone cache even if there's been a more recent delete than our write, if it was to a different RGW. In the cluster we found this, we have some RGW instances dedicated to replication only, and not serving client requests. These will never hit the tombstone cache in the event of a conflict!

2. the size of the map is finite. This size is tunable already, but it is possible to overload the map and end up in the sticky situation of non-matching zones.

In short, the issue is that we can't reliably tell when a delete of an object happened, so we will always replicate the object creation, even if the delete is later. There are a couple of ways we could go about resolving this, I think.

1. Use tombstones instead of completely deleting objects
2. Keep a cache of tombstones in RADOS rather than the RGW

Of the two, I think (2) is less work and less of a departure from the current state of affairs, so is probably the best course to investigate first.

I'm happy to work on a fix to this, but don't have the power to assign this to myself!

Actions

Copy link

Updated by Casey Bodley 12 months ago

Status changed from New to In Progress
Assignee set to Tom Coldrick
Tags set to multisite
Backport set to reef

Tom, thanks for the analysis and for joining the discussion about alternate solutions. for anyone that missed this discussion, a recording should be available soon under https://www.youtube.com/@Cephstorage/videos

to outline the proposed design:

1) when writing a replicated object, fetch_remote_obj() adds an object attribute RGW_ATTR_OBJ_REPLICATION_TRACE containing the zone trace. this work is partially completed in https://github.com/ceph/ceph/pull/49767 but needs more testing

2) when issuing a GET request to fetch an object, include a custom header like 'x-rgw-destination-zone-trace' with the destination zone's trace string

3) when handling that GetObject request, the source zone will return ERR_NOT_MODIFIED (304 Not Modified) if it finds that string in the local object's RGW_ATTR_OBJ_REPLICATION_TRACE attribute

this should prevent bucket full sync from overwriting a deleted object, because the source zone would already have the destination in its trace

Actions

Copy link

Updated by Casey Bodley 12 months ago

Casey Bodley wrote:

1) when writing a replicated object, fetch_remote_obj() adds an object attribute RGW_ATTR_OBJ_REPLICATION_TRACE containing the zone trace. this work is partially completed in https://github.com/ceph/ceph/pull/49767 but needs more testing

https://github.com/ceph/ceph/pull/49767 is ready for review and testing

Actions

Copy link

Updated by Tom Coldrick 11 months ago

We've opened https://github.com/ceph/ceph/pull/51715 to address this, which should be ready for review and testing. I've hammered these changes with the workload above for several hours without hitting the issue (without the fix, I can reliably reproduce within a few runs).

Actions

Copy link