Feature #64436: rgw: add remaining x-amz-replication-status options - Ceph - Ceph

Actions

Copy link

Feature #64436

open

rgw: add remaining x-amz-replication-status options

Added by Alex Wojno 3 months ago. Updated 8 days ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Alex Wojno

Category:

Target version:

% Done:

Source:

Tags:

rgw

Backport:

Reviewed:

Affected Versions:

Pull request ID:

57060

Description

The "REPLICA" option for the x-amz-replication-status feature has been implemented as a part of another tracker (https://tracker.ceph.com/issues/58565). This tracker is to implement the remaining options of "PENDING", "FAILED", and "COMPLETED", which are the header options for the original object. The requirements from as can be found here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication-status.html#replication-status-overview.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Casey Bodley 3 months ago

Assignee set to Alex Wojno

Actions

Copy link

Updated by Casey Bodley 3 months ago

when we evaluated this feature in the past, i wrote up this design for it:

https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication-status.html

The replication status of a replica will return REPLICA.

this is easy enough, objects written by sync can set
x-amz-replication-status=REPLICA

The replication status of a source object will return either PENDING, COMPLETED, or FAILED.

when an object is initially written, add
x-amz-replication-status=PENDING if the object name matches a
replication policy on its bucket, or if bucket sync is enabled by
default

the source zone tracks these object writes/deletes in its bucket index
log (bilog). destination zones read through those bilog entries and
fetch the latest copy of each object. these destination zones store
'bucket sync status' objects that track their current position in the
source zone's log

as part of its background bilog trimming process, the source zone does
regularly poll the destination zones for this bucket sync status.
bilog entries are eligible for trimming once all destination zones
report a more recent log position

so before trimming eligible bilog entries, the source zone could read
through those entries and try to overwrite each corresponding head
object with x-amz-replication-status=COMPLETED

this is inherently racy, because the source zone may have overwritten
that object several times before log trimming sees it. so this
overwrite must be conditional on RGW_ATTR_ID_TAG - we can use
cmpxattr() to assert that it still matches the rgw_bi_log_entry::tag

this algorithm would have linear complexity with respect to the number
of bilog entries, which could be very large. that makes it difficult
to schedule this with trimming work. we had a lot of trouble keeping
up with bilog trimming before we switched to
cls_cxx_map_remove_range(), and we should be careful not to compromise
that

i don't think this model can support a FAILED status. if replication
fails with 404 Not Found, that implies there's no source object to set
replication-status on. if replication fails with any other error, the
destination zone just won't advance its sync status position until a
retry succeeds

in terms of consistency, this log trimming could lag behind the actual
replication and leave the replication-status in PENDING. if there are
a lot of buckets in the system, it might be hours or even days before
a given bucket gets visited by trim

i wasn't satisfied with this design because it complicates log trimming (which has always been a sensitive subject) and doesn't guarantee timely updates to the source object's replication status

Actions

Copy link

Updated by Matt Benjamin 3 months ago

well, not clear on linear complexity? this isn't search, it's queuing; strictly, we need to handle everything that comes in, in the order it comes in, and everything that comes in (unless there was a local overwrite, as you state) turns into an attribute update; note that our original problems keeping up with trimming had to do with the fact that OMAP model turns linear (and maybe I mean amortized constant?) time complexity into logarithmic--and it still does, now that I recall, until we can land bucket-index OMAP offload. so particularly in light of that, this is a rather compelling argument. (it might be surmountable by completing the offload work, though?)

I think the argument from the requesting folks is, it would be providing polling avoidance for applications that need to perform further work once their replications are stabilized. it's ideal for applications to pay for what they use (most applications don't care when their objects replicated), but it's problematic that the ones which do would be reduced to polling endpoints, and that has all the race problems you mentioned. going back to 2017/8(?) or something Brett, Yuval, and I worked out a model where we let applications subscribe to some kind of completion; that begs some questions too, and I didn't really want to implement it either, but it ameliorates some of the problems above, maybe

Actions

Copy link

Updated by Casey Bodley 3 months ago

Matt Benjamin wrote:

well, not clear on linear complexity? this isn't search, it's queuing; strictly, we need to handle everything that comes in, in the order it comes in, and everything that comes in (unless there was a local overwrite, as you state) turns into an attribute update; note that our original problems keeping up with trimming had to do with the fact that OMAP model turns linear (and maybe I mean amortized constant?) time complexity into logarithmic--and it still does, now that I recall, until we can land bucket-index OMAP offload. so particularly in light of that, this is a rather compelling argument. (it might be surmountable by completing the offload work, though?)

sorry, the part about linear complexity in log trimming was a comparison with our current use of cls_cxx_map_remove_range() which deletes a range of keys in a single osd write operation. in that sense, it has constant-time complexity

with this design for replication status, we'd first have to list all the keys in that range, read each corresponding head object to see if its status changed, and possibly write an xattr update

one of the fundamental issues we've had with omap trimming has been that deletes are more expensive than the writes, and this adds a ton of extra work to the deletes for bilog trim

Actions

Copy link

Updated by Casey Bodley 3 months ago

Alex and i had a good discussion in slack DMs, sharing here:

2:23 PM Alex: Initial thoughts were to use the bilog of the destination to determine that the object has been replicated from source
2:23 PM Alex: Would very much appreciate any design notes you'd be willing to share
2:26 PM Alex: Not too clear on the details given how bilog reading from destination would be affected given the markers. Theoretically, we need some bidirectional communication to achieve this and leveraging bilog seems like the obvious candidate
2:41 PM Casey: that could work for bidirectional configurations, but we support unidirectional ones too. in that case, the destination zone doesn't write replication log entries, and the source zone doeesn't try to read/process replication logs from the destination
2:43 PM Casey: also have to think about zonegroups with more than two zones. if a given object needs to replicate to two or more other zones, you'd have to wait for all of them to complete before updating the status
2:54 PM Alex: For unidirectional configurations, we would need to enable bilog writing for destination zone and have source read, but not apply bilog changes (which it can determine from flow configuration as that is zonegroup level information I believe). This is an unfortunate side affect. For the multiple zones, we can store all the destination zones we replicated to as an attribute on the original object which is a list. On that list update, we can check against all the destinations we expect the object to be replicated to, which can be created from global information, we update the status to COMPLETED.
2:56 PM Casey: yeah
2:56 PM Alex: I saw the tracker update and I see the issue with log trimming potentially being an issue there
2:59 PM Alex: I think the timeliness of reporting less important given that if clients want to be alerted once replication happens, they can subscribe to notifications
2:59 PM Casey: your suggestion to feed this through the bilog and existing bucket sync mechanism would effectively multiply the total sync workload by N=replica count though
3:00 PM Casey: not counting object data transferred, at least
3:01 PM Casey: "the timeliness of reporting less important" see this is why i'd like to hear more about your intended use. to me it just doesn't sound useful if it's not timely
3:12 PM Alex: Wouldn't increasing the overall workload by N=replica count be a necessary condition of this feature given that the destinations would need to transmit replication success information back to source somehow, regardless of mechanism? If the issue is moving that workload to another mechanism other than the existing bucket sync in order to not overload sync, then that could be an option I could brainstorm on. For the "the timeliness of reporting less important", I meant that if the status was updated at a similar rate as replication, that seems to be acceptable. So if replication to destination would take 1 minute from upload, the status would be updated in 2 minutes from source upload.
3:18 PM Casey: ok, thanks
3:18 PM Casey: i assume you just want your users to be able to scan their objects and be sure that they've replicated?
3:19 PM Casey: tracking this at a per-object level just has a really high cost, so i've been looking for better ways to make this observeable
3:28 PM Casey: we've talked about some kind of bucket-level request that would report on its replication status. for example, an 'oldest change not applied' timestamp. that would be a coarse-grained view that doesn't require per-object tracking. the user could then look at object timestamps to infer whether or not they replicated yet
3:31 PM Alex: Agreed that per-object is costly but moving up to the bucket level abstraction for replication ability seems inconsistent in an active-active set up. If that did work however, in the GET response, the timestamp can be compared against the bucket replication info to add the COMPLETED header (edited)
3:32 PM Casey: right, if rgw had efficient access to the per-bucket or per-bucket-shard status it needs, it could make that inference itself to return the appropriate replication-status
3:35 PM Casey: for example, the timestamp of the oldest untrimmed entry in our local bilog shard
3:36 PM Casey: but that only advances at the speed of bilog trimming
3:36 PM Casey: so consitency isn't good, but it only costs us one local rados read op per head/get request
3:46 PM Alex: hmm yeah, bilog trimming pace might be too slow but I get the concept we are going for. I'll have to sleep on this one a bit

Actions

Copy link

Updated by Alex Wojno 2 months ago

The pace of bilog trimming seems acceptable given the value from this header would be to provide persistent information about replication status (rather than ephemeral information like notifications on replication) for bookkeeping purposes as the clients currently have no visibility into replication. The logic would follow something like this crude pseudocode:

// On a GET or HEAD request to object o:

if o.replication_status == "PENDING":
    // get_earliest_marker_time() would look at the min_marker across
    // all bucket shards and get the time. To make this more efficient,
    // this time could probably be stored as an attr somewhere during log trim?
    replicated_time = get_earliest_marker_time(bucket)

    if replicated_time > o.mtime:
        o.replication_status = "COMPLETED" 
    elif replicated_time <= o.mtime:
        // Do nothing because bilog for object has not been trimmed.