Bug #61359
closedConsistency bugs with OLH objects
100%
Description
When a PUT request is waiting on reshard, it does not properly update the bucket reference post-reshard and fails after storing the object instance, but before linking it into the bucket index. This results in us having the object stored on disk and accounted for in the bucket stats, but not visible in bucket listings. Additionally, it initializes the OLH RADOS object but never adds the user.rgw.olh.info xattr (which informs the is_olh() predicate). This means that future GET requests for that key return a 200 with an empty object since the OLH is recognized as a plain unversioned object. This can wreak havoc on clients that use well-known keys to store formatted data and fail to parse an unexpectedly empty object.
This was fixed on master and in Reef as part of the multi-site changes [1], but we could use a test case to ensure there are no future regressions on those branches. We need backports of [1] for Quincy and Pacific.
There is also a need for index cleanup tooling since buckets affected by this issue have inconsistent stats, inconsistent OLH RADOS objects, and dark data instance objects.
[1] https://github.com/ceph/ceph/commit/f57973725feeaa84321884c8eebc048989446572
Updated by Cory Snyder 12 months ago
- Affected Versions v16.0.0, v16.0.1, v16.1.0, v16.1.1, v16.2.0, v16.2.1, v16.2.10, v16.2.11, v16.2.12, v16.2.13, v16.2.2, v16.2.3, v16.2.4, v16.2.5, v16.2.6, v16.2.7, v16.2.8, v16.2.9, v17.0.0, v17.2.1, v17.2.2, v17.2.3, v17.2.4, v17.2.5 added
Updated by Casey Bodley 12 months ago
- Related to Bug #50552: rgw: set_olh return -2 when resharding added
Updated by Cory Snyder 12 months ago
- Related to Bug #59663: rgw: expired delete markers created by deleting non-existant object multiple times are not being removed from data pool after deletion from bucket added
Updated by Cory Snyder 12 months ago
- Related to Bug #59164: LC rules cause latency spikes added
Updated by Casey Bodley 12 months ago
- Status changed from New to Fix Under Review
- Backport changed from pacific quincy to pacific quincy reef
tagged for reef since we'll at least want the recovery command there
Updated by Cory Snyder 11 months ago
With further investigation, I found that the previously referenced commit [1] was not responsible for fixing this scenario on main/reef. In fact, that commit was actually resolving a different sort of PUT 404 scenario that did not affect earlier releases.
The actual reason that this issue isn't observed on main/reef is due to [2]. The fact that the bucket instance ID doesn't change during resharding means that there is no bucket instance metadata object to remove, and an attempt to retrieve the bucket instance metadata object associated with the old bucket instance is what was causing the ENOENT error.
[1] https://github.com/ceph/ceph/commit/f57973725feeaa84321884c8eebc048989446572
[2] https://github.com/ceph/ceph/pull/39002/commits/7348a8397af99752fd64ce0a44a95a405c6b9e3e
Updated by Cory Snyder 11 months ago
- Subject changed from PUT requests during reshard of versioned bucket fail with 404 and leave behind dark data to Consistency bugs with OLH objects
Updated by Cory Snyder 11 months ago
- Related to Bug #61710: quincy/pacific: PUT requests during reshard of versioned bucket fail with 404 and leave behind dark data added
Updated by Casey Bodley 10 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 10 months ago
- Copied to Backport #62064: pacific: Consistency bugs with OLH objects added
Updated by Backport Bot 10 months ago
- Copied to Backport #62065: reef: Consistency bugs with OLH objects added
Updated by Backport Bot 10 months ago
- Copied to Backport #62066: quincy: Consistency bugs with OLH objects added
Updated by Cory Snyder 10 months ago
- Related to Bug #62075: New radosgw-admin commands to cleanup leftover OLH index entries and unlinked instance objects added
Updated by Konstantin Shalygin about 1 month ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100