Ceph &raquo; rgw

[1] https://github.com/ceph/ceph/commit/f57973725feeaa84321884c8eebc048989446572

Target version:

% Done:

100%

Source:

Community (dev)

Tags:

backport_processed

Backport:

pacific quincy reef

Regression:

Severity:

1 - critical

Reviewed:

ceph-qa-suite:

Pull request ID:

51700

Crash signature (v1):

Crash signature (v2):

Description

When a PUT request is waiting on reshard, it does not properly update the bucket reference post-reshard and fails after storing the object instance, but before linking it into the bucket index. This results in us having the object stored on disk and accounted for in the bucket stats, but not visible in bucket listings. Additionally, it initializes the OLH RADOS object but never adds the user.rgw.olh.info xattr (which informs the is_olh() predicate). This means that future GET requests for that key return a 200 with an empty object since the OLH is recognized as a plain unversioned object. This can wreak havoc on clients that use well-known keys to store formatted data and fail to parse an unexpectedly empty object.

This was fixed on master and in Reef as part of the multi-site changes [1], but we could use a test case to ensure there are no future regressions on those branches. We need backports of [1] for Quincy and Pacific.

There is also a need for index cleanup tooling since buckets affected by this issue have inconsistent stats, inconsistent OLH RADOS objects, and dark data instance objects.

Related issues 8 (1 open — 7 closed)

Related to rgw - Bug #50552: rgw: set_olh return -2 when resharding

Triaged

Mark Kogan

Related to rgw - Bug #59663: rgw: expired delete markers created by deleting non-existant object multiple times are not being removed from data pool after deletion from bucket

Resolved

Related to rgw - Bug #59164: LC rules cause latency spikes

Can't reproduce

Related to rgw - Bug #61710: quincy/pacific: PUT requests during reshard of versioned bucket fail with 404 and leave behind dark data

Won't Fix

Related to rgw - Bug #62075: New radosgw-admin commands to cleanup leftover OLH index entries and unlinked instance objects

Resolved

Copied to rgw - Backport #62064: pacific: Consistency bugs with OLH objects

Resolved

Copied to rgw - Backport #62065: reef: Consistency bugs with OLH objects

Resolved

Copied to rgw - Backport #62066: quincy: Consistency bugs with OLH objects

Resolved

Updated by Cory Snyder 12 months ago

Affected Versions v16.0.0, v16.0.1, v16.1.0, v16.1.1, v16.2.0, v16.2.1, v16.2.10, v16.2.11, v16.2.12, v16.2.13, v16.2.2, v16.2.3, v16.2.4, v16.2.5, v16.2.6, v16.2.7, v16.2.8, v16.2.9, v17.0.0, v17.2.1, v17.2.2, v17.2.3, v17.2.4, v17.2.5 added

Actions

Updated by Cory Snyder 12 months ago

Pull request ID set to 51700

Actions

Updated by Casey Bodley 12 months ago

Related to Bug #50552: rgw: set_olh return -2 when resharding added

Actions

Updated by Cory Snyder 12 months ago

Related to Bug #59663: rgw: expired delete markers created by deleting non-existant object multiple times are not being removed from data pool after deletion from bucket added

Actions

Updated by Cory Snyder 12 months ago

Related to Bug #59164: LC rules cause latency spikes added

Actions

Updated by Casey Bodley 12 months ago

Status changed from New to Fix Under Review
Backport changed from pacific quincy to pacific quincy reef

tagged for reef since we'll at least want the recovery command there

Actions

[1] https://github.com/ceph/ceph/commit/f57973725feeaa84321884c8eebc048989446572
[2] https://github.com/ceph/ceph/pull/39002/commits/7348a8397af99752fd64ce0a44a95a405c6b9e3e

Updated by Cory Snyder 11 months ago

With further investigation, I found that the previously referenced commit [1] was not responsible for fixing this scenario on main/reef. In fact, that commit was actually resolving a different sort of PUT 404 scenario that did not affect earlier releases.

The actual reason that this issue isn't observed on main/reef is due to [2]. The fact that the bucket instance ID doesn't change during resharding means that there is no bucket instance metadata object to remove, and an attempt to retrieve the bucket instance metadata object associated with the old bucket instance is what was causing the ENOENT error.

Actions

Updated by Cory Snyder 11 months ago

Subject changed from PUT requests during reshard of versioned bucket fail with 404 and leave behind dark data to Consistency bugs with OLH objects

Actions