Project

General

Profile

Bug #57942

rgw leaks rados objects when a part is submitted multiple times in a multipart upload

Added by Peter Goron 3 months ago. Updated about 2 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

Issue presented below affects all ceph versions at least since 14.2 (reproducer tested on 14.2, 15.2, 16.2, 17.2.5)

S3 clients can use a retry policy when uploading individual part to mitigate transient network issues (https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html).
When a part has been uploaded multiple times (typically after retries), if the initial attempts have reached ceph, they are never purged, leaking index entries and rados objects.
It leads to:
  • an increase of bucket space usage that is not explained when user list objects of their bucket via s3 rest api (leaked part uploads are not reported, expected but it generates confusion when users try to understand their quota usage)

In our case, this issue is recurrent on a cluster with many concurrent multipart uploads every 5 min and where aws s3 clients can timeout before rgws had time to process the load.

Reproducer:

  • Execute following python script, it will create a test bucket, in which it will upload an object with two 6 MiB parts and simulating a retry of part 1
    import boto3
    import logging
    import time
    from botocore.config import Config as boto_config
    
    boto3.set_stream_logger('botocore', logging.INFO)
    
    # nautilus
    client = boto3.client('s3',
                            endpoint_url='<ceph_url>',
                            aws_access_key_id='<access_key>',
                            aws_secret_access_key='<secret_key>',
                            config=boto_config(
                              signature_version='s3v4',
                              s3={'addressing_style': 'path'})  # set path to do path style access
                          )
    
    # create bucket
    bucket='test'
    print(client.create_bucket(Bucket=bucket))
    
    # initiate a multipart upload
    key='object-00001'
    mpu = client.create_multipart_upload(Bucket=bucket, Key=key)
    mpu_id = mpu["UploadId"]
    parts=[]
    print(mpu)
    
    part_data = "a" * 1024 * 1024 * 6
    # upload part 1
    part=client.upload_part(Body=part_data, Bucket=bucket, Key=key, UploadId=mpu_id, PartNumber=1)
    print(part)
    
    time.sleep(5)
    
    # reupload part 1 (simulating a retry)
    part=client.upload_part(Body=part_data, Bucket=bucket, Key=key, UploadId=mpu_id, PartNumber=1)
    print(part)
    parts.append({"PartNumber": 1, "ETag": part["ETag"]})
    
    time.sleep(5)
    
    # upload part 2
    part=client.upload_part(Body=part_data, Bucket=bucket, Key=key, UploadId=mpu_id, PartNumber=2)
    print(part)
    parts.append({"PartNumber": 2, "ETag": part["ETag"]})
    
    time.sleep(5)
    
    # complete multipart upload
    print(client.complete_multipart_upload(
            Bucket=bucket,
            Key=key,
            UploadId=mpu_id,
            MultipartUpload={"Parts": parts}))
    

Above code will generates following rados objects:

# rados -p default.rgw.buckets.data ls | xargs -n 1 rados -p default.rgw.buckets.data stat
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__multipart_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.1 mtime 2022-10-28T11:29:05.000000+0000, size 4194304 <- part1 head, first attempt
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__shadow_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.1_1 mtime 2022-10-28T11:29:05.000000+0000, size 2097152 <- part1 tail, first attempt

default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__multipart_object-00001.5QfKe-yJL2cr_CDV2muU27LcCgAiIi5F.1 mtime 2022-10-28T11:29:10.000000+0000, size 4194304 <- part1 head, second attempt
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__shadow_object-00001.5QfKe-yJL2cr_CDV2muU27LcCgAiIi5F.1_1 mtime 2022-10-28T11:29:10.000000+0000, size 2097152 <- part1 tail, second attempt

default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__multipart_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.2 mtime 2022-10-28T11:29:15.000000+0000, size 4194304 <- part2 head
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__shadow_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.2_1 mtime 2022-10-28T11:29:15.000000+0000, size 2097152 <- part2 tail

default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5_object-00001 mtime 2022-10-28T11:29:20.000000+0000, size 0 <- obj head

And here are details about the index that reports two objects (whereas only one is expected)

radosgw-admin bucket stats --bucket test
{
    "bucket": "test",
    "num_shards": 11,
    "tenant": "",
    "zonegroup": "c06c47b1-cbbd-4c30-bc1c-fe2d2a62f16d",
    "placement_rule": "default-placement",
    "explicit_placement": {
        "data_pool": "",
        "data_extra_pool": "",
        "index_pool": "" 
    },
    "id": "f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5",
    "marker": "f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5",
    "index_type": "Normal",
    "owner": "nano",
    "ver": "0#1,1#1,2#1,3#1,4#1,5#1,6#1,7#1,8#7,9#1,10#1",
    "master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0",
    "mtime": "0.000000",
    "creation_time": "2022-10-28T11:29:04.929873Z",
    "max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#",
    "usage": {
        "rgw.main": {
            "size": 18874368,
            "size_actual": 18874368,
            "size_utilized": 18874368,
            "size_kb": 18432,
            "size_kb_actual": 18432,
            "size_kb_utilized": 18432,
            "num_objects": 2
        },
        "rgw.multimeta": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 0
        }
    },
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    }
}

# radosgw-admin bi list --bucket test | grep idx
        "idx": "_multipart_object-00001.5QfKe-yJL2cr_CDV2muU27LcCgAiIi5F.1", <- index entry for head of part1 SECOND attempt
        "idx": "object-00001", <- index entry for head of final object

On client side, a single object reported (12MiB vs 18MiB reported by radosgw-admin bucket stats):

# ./mc ls -r ceph-nano test
[2022-10-28 13:29:20 CEST]  12MiB STANDARD test/object-00001

If we delete it, the index still contains the SECOND attempt of part1

# ./mc rm ceph-nano/test/object-00001
Removed `ceph-nano/test/object-00001`.

# radosgw-admin bucket stats --bucket test
{
    "bucket": "test",
    "num_shards": 11,
    "tenant": "",
    "zonegroup": "c06c47b1-cbbd-4c30-bc1c-fe2d2a62f16d",
    "placement_rule": "default-placement",
    "explicit_placement": {
        "data_pool": "",
        "data_extra_pool": "",
        "index_pool": "" 
    },
    "id": "f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5",
    "marker": "f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5",
    "index_type": "Normal",
    "owner": "nano",
    "ver": "0#1,1#1,2#1,3#1,4#1,5#1,6#1,7#1,8#8,9#1,10#1",
    "master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0",
    "mtime": "0.000000",
    "creation_time": "2022-10-28T11:29:04.929873Z",
    "max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#",
    "usage": {
        "rgw.main": {
            "size": 6291456,
            "size_actual": 6291456,
            "size_utilized": 6291456,
            "size_kb": 6144,
            "size_kb_actual": 6144,
            "size_kb_utilized": 6144,
            "num_objects": 1
        },
        "rgw.multimeta": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 0
        }
    },
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    }
}
# radosgw-admin bi list --bucket test | grep idx
        "idx": "_multipart_object-00001.5QfKe-yJL2cr_CDV2muU27LcCgAiIi5F.1", <- index entry for head of part1 SECOND attempt

On rados side, final object head has been removed but all parts and their shadow are still present until we force GC

# rados -p default.rgw.buckets.data ls | xargs -n 1 rados -p default.rgw.buckets.data stat
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__shadow_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.1_1 mtime 2022-10-28T11:29:05.000000+0000, size 2097152
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__multipart_object-00001.5QfKe-yJL2cr_CDV2muU27LcCgAiIi5F.1 mtime 2022-10-28T11:29:10.000000+0000, size 4194304
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__multipart_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.2 mtime 2022-10-28T11:29:15.000000+0000, size 4194304
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__shadow_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.2_1 mtime 2022-10-28T11:29:15.000000+0000, size 2097152
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__multipart_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.1 mtime 2022-10-28T11:29:05.000000+0000, size 4194304
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__shadow_object-00001.5QfKe-yJL2cr_CDV2muU27LcCgAiIi5F.1_1 mtime 2022-10-28T11:29:10.000000+0000, size 2097152

# radosgw-admin gc process --include-all

After GC, it remains only head & shadow of part 1 FIRST attempt:

# rados -p default.rgw.buckets.data ls | xargs -n 1 rados -p default.rgw.buckets.data stat
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__shadow_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.1_1 mtime 2022-10-28T11:29:05.000000+0000, size 2097152
default.rgw.buckets.data/f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__multipart_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.1 mtime 2022-10-28T11:29:05.000000+0000, size 4194304

# radosgw-admin bucket radoslist --bucket test
(nothing returned)
# rgw-orphan-list
Available pools:
    .rgw.root
    default.rgw.log
    .mgr
    default.rgw.control
    default.rgw.meta
    default.rgw.buckets.index
    default.rgw.buckets.non-ec
    default.rgw.buckets.data
Which pool do you want to search for orphans (for multiple, use space-separated list)? default.rgw.buckets.data
Pool is "default.rgw.buckets.data".
Note: output files produced will be tagged with the current timestamp -- 20221028120023.
running 'rados ls' at Fri Oct 28 12:00:31 UTC 2022
running 'rados ls' on pool default.rgw.buckets.data.
running 'radosgw-admin bucket radoslist' at Fri Oct 28 12:00:31 UTC 2022
computing delta at Fri Oct 28 12:00:31 UTC 2022
2 potential orphans found out of a possible 2 (100%).
The results can be found in './orphan-list-20221028120023.out'.
    Intermediate files are './rados-20221028120023.intermediate' and './radosgw-admin-20221028120023.intermediate'.
***
*** WARNING: This is EXPERIMENTAL code and the results should be used
***          only with CAUTION!
***
Done at Fri Oct 28 12:00:31 UTC 2022.

# cat ./orphan-list-20221028120023.out
f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__multipart_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.1
f8a4d551-0e1f-49c2-b19c-2b81a6b0423e.14117.5__shadow_object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU.1_1

At this stage, two issues:
1. rgw index contains an orphan entry (ie no rados object associated): _multipart_object-00001.5QfKe-yJL2cr_CDV2muU27LcCgAiIi5F.1, corresponding to the SECOND upload of part1
2. data pool contains 2 orphan rados objects not referenced in index and not referenced in any manifest, corresponding to the FIRST upload tentative of part1

For the first issue, I haven't yet been able to track it down completely, but I'm pretty sure issue lies in RadosMultipartUpload::complete (it computes list of index entries to remove when completing a multipart upload).

For the second issue (orphan rados objects), the culprit seems to be MultipartObjectProcessor::complete (completion of a part upload). When it faces a new upload part, it updates multipart manifest with the head of the new part without marking for gc any existing objects for uploaded part number.


Related issues

Duplicates rgw - Bug #16767: RadosGW Multipart Cleanup Failure Fix Under Review

History

#1 Updated by Peter Goron 3 months ago

After digging more on the issue, I think the root cause is linked to following code:

https://github.com/ceph/ceph/blob/main/src/rgw/rgw_putobj_processor.cc#L363

int MultipartObjectProcessor::process_first_chunk(bufferlist&& data,
                                                  DataProcessor **processor)
{
  // write the first chunk of the head object as part of an exclusive create,
  // then drain to wait for the result in case of EEXIST
  int r = writer.write_exclusive(data);
  if (r == -EEXIST) {
    // randomize the oid prefix and reprepare the head/manifest
    std::string oid_rand = gen_rand_alphanumeric(store->ctx(), 32);

    mp.init(target_obj->get_name(), upload_id, oid_rand);
    manifest.set_prefix(target_obj->get_name() + "." + oid_rand);

    r = prepare_head();
    if (r < 0) {
      return r;
    }
    // resubmit the write op on the new head object
    r = writer.write_exclusive(data);
  }
  if (r < 0) {
    return r;
  }
  *processor = &stripe;
  return 0;
}

When a part is re-uploaded, a new prefix is computed for part object.

In above example
initial prefix: object-00001.2~flIvzohLSlJ4V3PoVYadiuuNO1IRhjU
new prefix for part 1 after retry: object-00001.5QfKe-yJL2cr_CDV2muU27LcCgAiIi5F

During multipart upload completion, rgw deletes meta and part index entries and creates an entry for the final object.
Part keys to remove are rebuilt using initial prefix (target object name + "." + initial upload id)
https://github.com/ceph/ceph/blob/main/src/rgw/rgw_sal_rados.cc#L2549

      string oid = mp_obj.get_part(part->info.num);
      rgw_obj src_obj;
      src_obj.init_ns(bucket->get_key(), oid, mp_ns);

...

      rgw_obj_index_key remove_key;
      src_obj.key.get_index_key(&remove_key);

      remove_objs.push_back(remove_key);

For the reuploaded parts, it will delete index entries of the first tentative (because initial prefix matches) but not for the next attempts. It explains why index still contain re-uploaded parts after mp upload completion.

In rados pool, all objects corresponding to intermediate part retries will stay orphan indefinitely (not referenced in manifest of the final object).

#2 Updated by Peter Goron 3 months ago

FI Working on https://github.com/pgoron/ceph/commits/fix_rgw_rados_leaks_57942 to fix both issues (index entry leaks and rados object leaks). Validated on 17.2.5, now I'm preparing a PR on main branch.

#3 Updated by Casey Bodley 3 months ago

  • Duplicates Bug #16767: RadosGW Multipart Cleanup Failure added

#5 Updated by Casey Bodley about 2 months ago

  • Status changed from New to Duplicate

closing this in favor of https://tracker.ceph.com/issues/16767

Also available in: Atom PDF