Bug #63799
openmultisite: lc expiration action on versioned buckets generates delete-marker with different version ids on different zones
0%
Description
In multisite settings, lifecycle on each zone would generate a delete marker with their own version id if the lc process happens before the delete-marker replication.
This can cause problems if either zone deletes their delete marker. When another zone tries to replicate that deletion, they'd fail to find that version so leave their own delete marker intact. At this point, the zones could respond differently to GET requests for the object name. And if the source zone goes on to delete their empty bucket, the other zones would end up orphaning the corresponding rados object.
Updated by Matt Benjamin 5 months ago
I think this issue is at least partially related to one being worked on by Kalpesh and Shilpa. It makes sense to discuss in the refactoring meeting, I'll alert them.
Updated by Jane Zhu 5 months ago
Some discussion from a PR https://github.com/ceph/ceph/pull/54759#discussion_r1424894258
smanjara in this scenario, I'd have expected two delete marker versions on either zones, one from its own delete op, and the second syncing from the other zone. but I don't think we allow multiple delete markers for an object. smanjara the second delete marker creation will fail because we return an -ENOENT here in rgw.bucket_link_olh() if we already have a delete marker in https://github.com/ceph/ceph/blob/main/src/cls/rgw/cls_rgw.cc#L1695-L1702 jzhu116-bloomberg Yes, this is exactly what I observed from my testing. The replication failed to create the delete-marker with the following error 2023-12-13T01:57:17.297-0500 7f393fa49700 0 rgw async rados processor: ERROR: bucket shard callback failed. obj=file_4k[WFpr4YYECQ6z9qtSFkOdbS6v.1Cjg6.]. ret=(2) No such file or directory
Updated by Shilpa MJ 5 months ago
Hi Jane,
Following up on our conversation, this block (https://github.com/ceph/ceph/blob/main/src/cls/rgw/cls_rgw.cc#L1695-L1702) to prevent rgw from creating multiple delete markers was introduced as a fix to an LC expiration issue as described in https://tracker.ceph.com/issues/51249.
But it was more of an LC bug than a delete marker one and changing the delete marker behaviour was unnecessary. there is more conversation about this in https://github.com/ceph/ceph/pull/45754.
For this pr about multisite in particular, if we allow multiple delete markers to exist,then we could let the zones take care of syncing their creation and deletion without needing to add any special handling because the same versions would be maintained on both zones. So, I'm proposing that we revert the changes made in https://github.com/ceph/ceph/pull/41897, and test scenarios involving multiple delete markers and see how they behave with LC and multisite in picture.
Would you be interested in helping with testing this?
Updated by Jane Zhu 5 months ago
Shilpa MJ wrote:
Hi Jane,
Following up on our conversation, this block (https://github.com/ceph/ceph/blob/main/src/cls/rgw/cls_rgw.cc#L1695-L1702) to prevent rgw from creating multiple delete markers was introduced as a fix to an LC expiration issue as described in https://tracker.ceph.com/issues/51249.
But it was more of an LC bug than a delete marker one and changing the delete marker behaviour was unnecessary. there is more conversation about this in https://github.com/ceph/ceph/pull/45754.For this pr about multisite in particular, if we allow multiple delete markers to exist,then we could let the zones take care of syncing their creation and deletion without needing to add any special handling because the same versions would be maintained on both zones. So, I'm proposing that we revert the changes made in https://github.com/ceph/ceph/pull/41897, and test scenarios involving multiple delete markers and see how they behave with LC and multisite in picture.
Would you be interested in helping with testing this?
Sure thing. I can test this.
Just to clarify. The plan is if this works well with both lc and multisite, we will go this route instead of > overwriting the delete marker, during multisite replication, with the later timestamp one as we talked about in the > refactoring meeting right?
Thanks! Yes, that's right. If this works, then we don't need any of the changes we talked about in the refactoring meeting.