Bug #58673
closedWhen bucket index ops are cancelled it can leave behind zombie index entries
100%
Description
We discovered that there were a significant number of extra bucket index entries for some of our buckets and found that these entries all pointed to objects which no longer existed. In our case, we traced this back to a scenario where a particular client commonly issues multiple simultaneous delete requests for the same object keys. The first racing delete request succeeds, but the second on results in an ECANCELED error due to a failed cmpxattr check [1] set by a prepare_atomic_modification call [2]. The ECANCELED error causes the index op to be canceled [3], but the osd cls logic for index op cancellation doesn't remove the index entry. The zombie index entry is never cleaned up. It looks like this could possibly manifest itself in other scenarios as well, whenever an index op is canceled for an index entry that otherwise shouldn't exist and has no other pending modifications.
[1] https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_rados.cc#L5833
[2] https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_rados.cc#L5254
[3] https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_rados.cc#L5293
Updated by Casey Bodley about 1 year ago
- Status changed from New to Fix Under Review
- Tags set to cls_rgw
Updated by J. Eric Ivancich about 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 1 year ago
- Copied to Backport #58767: pacific: When bucket index ops are cancelled it can leave behind zombie index entries added
Updated by Backport Bot about 1 year ago
- Copied to Backport #58768: quincy: When bucket index ops are cancelled it can leave behind zombie index entries added
Updated by Backport Bot about 1 year ago
- Tags changed from cls_rgw to cls_rgw backport_processed
Updated by Cory Snyder 11 months ago
- Related to Bug #59164: LC rules cause latency spikes added
Updated by Konstantin Shalygin 9 months ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100