Project

General

Profile

Actions

Bug #61359

closed

Consistency bugs with OLH objects

Added by Cory Snyder 12 months ago. Updated 29 days ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

100%

Source:
Community (dev)
Tags:
backport_processed
Backport:
pacific quincy reef
Regression:
No
Severity:
1 - critical
Reviewed:

Description

When a PUT request is waiting on reshard, it does not properly update the bucket reference post-reshard and fails after storing the object instance, but before linking it into the bucket index. This results in us having the object stored on disk and accounted for in the bucket stats, but not visible in bucket listings. Additionally, it initializes the OLH RADOS object but never adds the user.rgw.olh.info xattr (which informs the is_olh() predicate). This means that future GET requests for that key return a 200 with an empty object since the OLH is recognized as a plain unversioned object. This can wreak havoc on clients that use well-known keys to store formatted data and fail to parse an unexpectedly empty object.

This was fixed on master and in Reef as part of the multi-site changes [1], but we could use a test case to ensure there are no future regressions on those branches. We need backports of [1] for Quincy and Pacific.

There is also a need for index cleanup tooling since buckets affected by this issue have inconsistent stats, inconsistent OLH RADOS objects, and dark data instance objects.

[1] https://github.com/ceph/ceph/commit/f57973725feeaa84321884c8eebc048989446572


Related issues 8 (1 open7 closed)

Related to rgw - Bug #50552: rgw: set_olh return -2 when reshardingTriagedMark Kogan

Actions
Related to rgw - Bug #59663: rgw: expired delete markers created by deleting non-existant object multiple times are not being removed from data pool after deletion from bucketResolvedCory Snyder

Actions
Related to rgw - Bug #59164: LC rules cause latency spikesCan't reproduce

Actions
Related to rgw - Bug #61710: quincy/pacific: PUT requests during reshard of versioned bucket fail with 404 and leave behind dark dataWon't FixCory Snyder

Actions
Related to rgw - Bug #62075: New radosgw-admin commands to cleanup leftover OLH index entries and unlinked instance objectsResolvedCory Snyder

Actions
Copied to rgw - Backport #62064: pacific: Consistency bugs with OLH objectsResolvedCory SnyderActions
Copied to rgw - Backport #62065: reef: Consistency bugs with OLH objectsResolvedCory SnyderActions
Copied to rgw - Backport #62066: quincy: Consistency bugs with OLH objectsResolvedCory SnyderActions
Actions #1

Updated by Cory Snyder 12 months ago

  • Affected Versions v16.0.0, v16.0.1, v16.1.0, v16.1.1, v16.2.0, v16.2.1, v16.2.10, v16.2.11, v16.2.12, v16.2.13, v16.2.2, v16.2.3, v16.2.4, v16.2.5, v16.2.6, v16.2.7, v16.2.8, v16.2.9, v17.0.0, v17.2.1, v17.2.2, v17.2.3, v17.2.4, v17.2.5 added
Actions #2

Updated by Cory Snyder 12 months ago

  • Pull request ID set to 51700
Actions #3

Updated by Casey Bodley 12 months ago

  • Related to Bug #50552: rgw: set_olh return -2 when resharding added
Actions #4

Updated by Cory Snyder 11 months ago

  • Related to Bug #59663: rgw: expired delete markers created by deleting non-existant object multiple times are not being removed from data pool after deletion from bucket added
Actions #5

Updated by Cory Snyder 11 months ago

  • Related to Bug #59164: LC rules cause latency spikes added
Actions #6

Updated by Casey Bodley 11 months ago

  • Status changed from New to Fix Under Review
  • Backport changed from pacific quincy to pacific quincy reef

tagged for reef since we'll at least want the recovery command there

Actions #7

Updated by Cory Snyder 11 months ago

With further investigation, I found that the previously referenced commit [1] was not responsible for fixing this scenario on main/reef. In fact, that commit was actually resolving a different sort of PUT 404 scenario that did not affect earlier releases.

The actual reason that this issue isn't observed on main/reef is due to [2]. The fact that the bucket instance ID doesn't change during resharding means that there is no bucket instance metadata object to remove, and an attempt to retrieve the bucket instance metadata object associated with the old bucket instance is what was causing the ENOENT error.

[1] https://github.com/ceph/ceph/commit/f57973725feeaa84321884c8eebc048989446572
[2] https://github.com/ceph/ceph/pull/39002/commits/7348a8397af99752fd64ce0a44a95a405c6b9e3e

Actions #8

Updated by Cory Snyder 11 months ago

  • Subject changed from PUT requests during reshard of versioned bucket fail with 404 and leave behind dark data to Consistency bugs with OLH objects
Actions #9

Updated by Cory Snyder 11 months ago

  • Related to Bug #61710: quincy/pacific: PUT requests during reshard of versioned bucket fail with 404 and leave behind dark data added
Actions #10

Updated by Casey Bodley 10 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #11

Updated by Backport Bot 10 months ago

  • Copied to Backport #62064: pacific: Consistency bugs with OLH objects added
Actions #12

Updated by Backport Bot 10 months ago

Actions #13

Updated by Backport Bot 10 months ago

  • Copied to Backport #62066: quincy: Consistency bugs with OLH objects added
Actions #14

Updated by Backport Bot 10 months ago

  • Tags set to backport_processed
Actions #15

Updated by Cory Snyder 10 months ago

  • Related to Bug #62075: New radosgw-admin commands to cleanup leftover OLH index entries and unlinked instance objects added
Actions #16

Updated by Konstantin Shalygin 29 days ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF