Bug #57770: RGW (pacific) misplaces index entries after dynamically resharding bucket - rgw - Ceph

Actions

Copy link

Bug #57770

closed

RGW (pacific) misplaces index entries after dynamically resharding bucket

Added by Nick Janus over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

J. Eric Ivancich

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v14.2.18, Ceph - v16.2.9

ceph-qa-suite:

Pull request ID:

48663

Crash signature (v1):

Crash signature (v2):

Description

When RGW reshards buckets with ~250k index entries*, I've noticed some s3:PutObject requests that return 200 end up with index entries under the old index shard oids, presumably recreated after dynamically resharding deleted the original shards. These objects can be retrieved via s3:GetObject, since that doesn't typically look for an index entry, but s3:ListObjects doesn't show any information about the successfully written object.

This does not happen reliably. I've been able to recreate this in a staging environment in 1-4 of every 30 buckets, with and without rgw caching enabled on the rgw nodes serving the s3:PutObject requests. mtime on the misplaced index entries shows a burst of writes (all within a few seconds) going to the wrong index after resharding completes, with the majority of writes going to the correct index. This burst can happen multiple times after resharding, from 20s after resharding completes to 12m. I haven't spotted any useful error debug logging in the client logs, but have also not turned up logging very high at all, in part due the volume of requests required for recreating this issue. Let me know if there's any debug information that would be useful. Thanks for reading!

I don't think 250k is significant in any way, it's just the threshold at which we tend to reshard the most buckets.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Matt Benjamin over 1 year ago

Assignee set to J. Eric Ivancich

Actions

Copy link

Updated by J. Eric Ivancich over 1 year ago

Status changed from New to Need More Info

So I looked at the code in 16.2.9 to try to understand how this might happen. The final step in adding an object to the bucket index is a call to rgw_bucket_complete_op() in src/cls/rgw/cls_rgw.cc.

It matches the completion op to the corresponding op in pending_map via a matching tag. If it fails to find matching op in the pending_map, then the operation fails and the item is never added to the bucket index.

The theory is that the bucket index shard does not exist at this moment, as it was deleted due to resharding and then recreated as part of this process. But if it's deleted then it wouldn't have been able to find the matching tag.

So are OSDs caching these objects despite removal?

Actions

Copy link

Updated by J. Eric Ivancich over 1 year ago

Here is the code that does this:

  if (op.tag.size()) {                                                                                                                                              
    auto pinter = entry.pending_map.find(op.tag);                                                                                                                   
    if (pinter == entry.pending_map.end()) {                                                                                                                        
      CLS_LOG(1, "ERROR: couldn't find tag for pending operation\n");                                                                                               
      return -EINVAL;                                                                                                                                               
    }                                                                                                                                                               
    entry.pending_map.erase(pinter);                                                                                                                                
  }

Actions

Copy link

Updated by Nick Janus over 1 year ago

J. Eric Ivancich wrote:

The theory is that the bucket index shard does not exist at this moment, as it was deleted due to resharding and then recreated as part of this process. But if it's deleted then it wouldn't have been able to find the matching tag.

So are OSDs caching these objects despite removal?

Hi Eric, I'm not sure if this is a rhetorical question, but we don't have any special caching mechanism associated with our osds. Is there a way I could test for this? Alternatively, is is possible this is is some kind of race between rgw threads or being caused by reuse of memory within a RGW thread?

Actions

Copy link

Updated by J. Eric Ivancich over 1 year ago

Nick Janus wrote:

J. Eric Ivancich wrote:

The theory is that the bucket index shard does not exist at this moment, as it was deleted due to resharding and then recreated as part of this process. But if it's deleted then it wouldn't have been able to find the matching tag.

So are OSDs caching these objects despite removal?

Hi Eric, I'm not sure if this is a rhetorical question, but we don't have any special caching mechanism associated with our osds. Is there a way I could test for this? Alternatively, is is possible this is is some kind of race between rgw threads or being caused by reuse of memory within a RGW thread?

I think the more likely case, which Casey came up with, is that we're re-creating the shard when we prepare the op, so that when we complete the op the shard exists along with the pending op. So I'm looking to see how best to recognize that situation and transitioning to the new bucket index shard.

Actions

Copy link

Updated by Casey Bodley over 1 year ago

Status changed from Need More Info to New

Actions

Copy link

Updated by Casey Bodley over 1 year ago

Status changed from New to Triaged

Actions

Copy link

Updated by J. Eric Ivancich over 1 year ago

Status changed from Triaged to Fix Under Review
Pull request ID set to 48663

Actions

Copy link

Updated by J. Eric Ivancich over 1 year ago

Nick,

I don't know that I have a cluster at my fingertips that might be necessary to test this potential fix. How small a reproducer are you able to make? And are you able to test a version for which a PR exists?

Thanks,

Eric

Actions

Copy link

#10

Updated by Nick Janus over 1 year ago

J. Eric Ivancich wrote:

Nick,

I don't know that I have a cluster at my fingertips that might be necessary to test this potential fix. How small a reproducer are you able to make? And are you able to test a version for which a PR exists?

Thanks,

Eric

Hi Eric,

I've tried reproducing with smaller buckets and rgw_debug enabled, but haven't managed to repro this with less than ~275k object buckets. If it backports ok to Pacific, I can test on one of our staging clusters next week. I'll give reproducing with smaller buckets another go too. Thank you for working on this!

Nick

Actions

Copy link

#11

Updated by J. Eric Ivancich over 1 year ago

The code on the PR seems to address the issue. My colleague Mark Kogan ran it through a test at scale and it behaved well. Unfortunately our testing lab has been down for a while, so it may not merge for a bit. Is it something you'd be able to build and try out, Nick?

Actions

Copy link

#12

Updated by J. Eric Ivancich over 1 year ago

Copied to Bug #58034: RGW misplaces index entries after dynamically resharding bucket added

Actions

Copy link

#13

Updated by J. Eric Ivancich over 1 year ago

Subject changed from RGW misplaces index entries after dynamically resharding bucket to RGW (pacific) misplaces index entries after dynamically resharding bucket

Actions

Copy link

#14

Updated by Nick Janus over 1 year ago

J. Eric Ivancich wrote:

The code on the PR seems to address the issue. My colleague Mark Kogan ran it through a test at scale and it behaved well. Unfortunately our testing lab has been down for a while, so it may not merge for a bit. Is it something you'd be able to build and try out, Nick?

Hi Eric! Sorry it's taken so long, but we've built this patch into our 16.2.9 branch, and it seems to fix the issue using our repro case! Thank you so much for looking into this.

Actions

Copy link

#15