Bug #16767: RadosGW Multipart Cleanup Failure - rgw - Ceph

Actions

Copy link

Bug #16767

closed

RadosGW Multipart Cleanup Failure

Added by Brian Felton almost 8 years ago. Updated 4 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Target version:

Ceph - v18.0.0

% Done:

100%

Source:

other

Tags:

rgw multipart backport_processed

Backport:

quincy pacific reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v0.94.6

ceph-qa-suite:

Pull request ID:

49709

Crash signature (v1):

Crash signature (v2):

Description

My current setup is a Ceph Hammer cluster running 0.94.6. The rest of the cluster details are irrelevant to this issue.

I've stumbled upon an issue whereby RGW is not cleaning up properly after a multipart upload is completed (either abort or complete). If a client re-uploads a part during a multipart upload, ceph will store both the original and new part, but only the latter part will be valid when POSTing the CompleteMultipartUpload XML payload. When the multipart upload is completed (either through abort or complete), only the initial parts will be removed from the system. The remaining parts are orphaned and are not (easily) removable.

To reproduce:

First, create four 5MiB files with unique md5 sums:

dd if=/dev/urandom of=/tmp/part1.1 bs=1M count=5
dd if=/dev/urandom of=/tmp/part1.2 bs=1M count=5
dd if=/dev/urandom of=/tmp/part2.1 bs=1M count=5
dd if=/dev/urandom of=/tmp/part2.2 bs=1M count=5

Next, initiate a multipart upload:

s3curl --id test -- -X POST http://ceph.cluster/bucket/mpobject?uploads

Upload the parts:

s3curl --id test --put /tmp/part1.1 -- http://ceph.cluster/bucket/mpobject?partNumber=1&uploadId=2~whateverid
s3curl --id test --put /tmp/part1.2 -- http://ceph.cluster/bucket/mpobject?partNumber=2&uploadId=2~whateverid
s3curl --id test --put /tmp/part2.1 -- http://ceph.cluster/bucket/mpobject?partNumber=1&uploadId=2~whateverid
s3curl --id test --put /tmp/part2.2 -- http://ceph.cluster/bucket/mpobject?partNumber=2&uploadId=2~whateverid

Now, let's take a look at what RGW says about the bucket:

radosgw-admin bucket stats --bucket=bucket | grep -A7 mptest | grep -v owner | grep -v instance

        "name": "mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.1",
        "namespace": "multipart",
        "size": 5242880,
        "mtime": "2016-07-21 18:43:15.000000Z",
        "etag": "785dec7eeb68366cca5c19cec86c508b",
--
        "name": "mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.2",
        "namespace": "multipart",
        "size": 5242880,
        "mtime": "2016-07-21 18:43:24.000000Z",
        "etag": "b11c15f456f17ba763d0fb900d22376c",
--
        "name": "mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.meta",
        "namespace": "multipart",
        "size": 0,
        "mtime": "2016-07-21 18:43:00.000000Z",
        "etag": "",
--
        "name": "mptest.feXQAxbcmjR1WdN_-b-jj1BKcObJ3Q6.2",
        "namespace": "multipart",
        "size": 5242880,
        "mtime": "2016-07-21 18:43:39.000000Z",
        "etag": "2d26aa403bc759305d0ea61d29f17cd0",
--
        "name": "mptest.i0q6uZ-do4mYoW7z5z8JDAQitcGJ5No.1",
        "namespace": "multipart",
        "size": 5242880,
        "mtime": "2016-07-21 18:43:31.000000Z",
        "etag": "a9fdb9efe0722f6e61d5d4ff3dfe0e81",

So we now have a meta file that contains the upload id, the first two attempted parts containing the upload id in the name, and the two subsequent parts that do not contain the upload id in the name.

Now, let's list the available parts associated with the id:

./s3curl --id test -- http://ceph.cluster/bucket/mpobject?uploadId=2~whateverid | xmlstarlet fo

<?xml version="1.0" encoding="UTF-8"?>
<ListPartsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Bucket>bucket</Bucket>
  <Key>mptest</Key>
  <UploadId>2~whateverid</UploadId>
...
  <Owner>
    <ID>7e1af43925cbef79334d2da290d602d586d04d7dd9aeb970c95ab93c0641c1f4</ID>
    <DisplayName>t3os_test</DisplayName>
  </Owner>
  <Part>
    <LastModified>2016-07-21T18:43:31.000Z</LastModified>
    <PartNumber>1</PartNumber>
    <ETag>a9fdb9efe0722f6e61d5d4ff3dfe0e81</ETag>
    <Size>5242880</Size>
  </Part>
  <Part>
    <LastModified>2016-07-21T18:43:39.000Z</LastModified>
    <PartNumber>2</PartNumber>
    <ETag>2d26aa403bc759305d0ea61d29f17cd0</ETag>
    <Size>5242880</Size>
  </Part>
</ListPartsResult>

We see here that the available parts are the last two uploaded. So far, so good.

Now, let's go ahead and complete this thing.

{builds valid CompeteMultipartUpload document}
./s3curl --id test --post mp.test -- http://ceph.cluster/bucket/mpobject?uploadId=2~whateverid

Great success! I can now download the object, and it shows to be the valid combination of the last two parts I uploaded.

Now, however, let's take a look at our bucket:

radosgw-admin bucket list --bucket=bucket | grep -A7 mptest | grep -v owner | grep -v instance
        "name": "mptest.feXQAxbcmjR1WdN_-b-jj1BKcObJ3Q6.2",
        "namespace": "multipart",
        "size": 5242880,
        "mtime": "2016-07-21 18:43:39.000000Z",
        "etag": "2d26aa403bc759305d0ea61d29f17cd0",
--
        "name": "mptest.i0q6uZ-do4mYoW7z5z8JDAQitcGJ5No.1",
        "namespace": "multipart",
        "size": 5242880,
        "mtime": "2016-07-21 18:43:31.000000Z",
        "etag": "a9fdb9efe0722f6e61d5d4ff3dfe0e81",
--
        "name": "mptest",
        "namespace": "",
        "size": 10485760,
        "mtime": "2016-07-21 18:52:23.000000Z",
        "etag": "39967388ccf40f9570e7f3154549e589-2",

Upon completing the request, only the two parts tagged with the upload id are removed from the system. If I list out the .rgw.buckets pool, I can confirm that all of the parts are still present:

rados -p .rgw.buckets ls | grep mptest

default.7754.6__shadow_mptest.feXQAxbcmjR1WdN_-b-jj1BKcObJ3Q6.2_1
default.7754.6_mptest
default.7754.6__multipart_mptest.feXQAxbcmjR1WdN_-b-jj1BKcObJ3Q6.2
default.7754.6__multipart_mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.2
default.7754.6__shadow_mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.2_1
default.7754.6__multipart_mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.1
default.7754.6__shadow_mptest.i0q6uZ-do4mYoW7z5z8JDAQitcGJ5No.1_1
default.7754.6__shadow_mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.1_1
default.7754.6__multipart_mptest.i0q6uZ-do4mYoW7z5z8JDAQitcGJ5No.1

Aborting the upload yields similar results, except in reverse. In the abort case, the files that contain the upload id in the name will be retained, but the other files will be properly removed.

For small multipart uploads like this, the additional space used is trivial. But in our actual cluster, we have clients that are uploading considerably larger files and are noticing that their bucket utilization is tens of TB larger than the sum of the objects they can list. The files are not removed by garbage collection, and are generally only removable through a very slow process of listing the omap contents of the bucket shards in .rgw.buckets.index and removing the omap keys that cannot be found.

Related issues 7 (3 open — 4 closed)

Related to rgw - Bug #44660: Multipart re-uploads cause orphan data

Pending Backport

Actions

Related to rgw - Bug #58369: When uploading parts in multipart upload, use the "AbortMultipartUpload" interface to end the upload, and there will be data that cannot be cleaned

New

J. Eric Ivancich

Actions

Related to rgw-testing - Bug #58780: scan for orphaned rados objects and index entries in rgw suite

Pending Backport

J. Eric Ivancich

Actions

Has duplicate rgw - Bug #57942: rgw leaks rados objects when a part is submitted multiple times in a multipart upload

Duplicate

Actions

Copied to rgw - Backport #59064: reef: RadosGW Multipart Cleanup Failure

Resolved

Mykola Golub

Actions

Copied to rgw - Backport #59065: quincy: RadosGW Multipart Cleanup Failure

Resolved

Mykola Golub

Actions

Copied to rgw - Backport #59066: pacific: RadosGW Multipart Cleanup Failure

Rejected

Actions

Copy link

Updated by Brian Felton over 7 years ago

With apologies for pestering, is there anyone looking into this? This bug affects the viability of Ceph as a backing store for a commercial product, as it can't be relied upon as a canonical source of truth for reporting storage utilization.

Also, what I reported earlier about being able to cleaning .rgw.buckets.index as a workaround was premature. While I can manually remove entries from .rgw.buckets and clean up orphans in .rgw.buckets.index, I've not found the secret sauce for actually getting the bucket's utilization to reflect the changes.

Actions

Copy link

Updated by Samuel Just over 7 years ago

Project changed from Ceph to rgw

Actions

Copy link

Updated by William Schroeder over 6 years ago

https://github.com/ceph/ceph/pull/17349 addresses the cleanup aspect of this bug; we wrote a tool that deletes the leaked multipart objects. The code is awaiting review, in case our change has unintended side-effects.

Actions

Copy link

Updated by George Mihaiescu over 6 years ago

It would be so great if this tool would be reviewed and back-ported to Jewel.

Actions

Copy link

Updated by Orit Wasserman over 6 years ago

Assignee set to Orit Wasserman

Actions

Copy link

Updated by Chris Jones about 4 years ago

This is still an issue even in luminous 12.2.12 and even in the nautilus versions we tested. Orphan find is impractical to run on large clusters. Has there been any work to address this issue?

Actions

Copy link

Updated by Vicki Good over 3 years ago

I've encountered this bug in Ceph 14 and 15 and it's a pretty big problem for us for the same reason Brian Felton mentioned. It affects our storage utilization reporting.

I have been unable to manually remove these left-behind objects from the pools, but running radosgw-admin bucket check --fix --check-objects does clean them up for buckets that are not sharded. That command does not work on sharded buckets. Even if it did work for all buckets, we would have to run it constantly for all buckets--not at all practical.

Is is possible to increase the priority of this bug?

Actions

Copy link

Updated by Casey Bodley about 3 years ago

Related to Bug #44660: Multipart re-uploads cause orphan data added

Actions

Copy link

Updated by Rok Jaklic over 1 year ago

Vicki Good wrote:

I've encountered this bug in Ceph 14 and 15 and it's a pretty big problem for us for the same reason Brian Felton mentioned. It affects our storage utilization reporting.

I have been unable to manually remove these left-behind objects from the pools, but running radosgw-admin bucket check --fix --check-objects does clean them up for buckets that are not sharded. That command does not work on sharded buckets. Even if it did work for all buckets, we would have to run it constantly for all buckets--not at all practical.

Is is possible to increase the priority of this bug?

We've encountered this bug also in Ceph 16.

It is pretty big problem for us also, since we do provisioning for users based on size_actual.

Actions

Copy link

#10

Updated by Matt Benjamin over 1 year ago

Status changed from New to In Progress
Assignee changed from Orit Wasserman to Matt Benjamin

Actions

Copy link

#11

Updated by Casey Bodley over 1 year ago

Pull request ID set to 37260

Actions

Copy link

#12

Updated by Casey Bodley over 1 year ago

Has duplicate Bug #57942: rgw leaks rados objects when a part is submitted multiple times in a multipart upload added

Actions

Copy link

#13

Updated by Aleksandr Rudenko over 1 year ago

It is very big problem for us.

We have a lot of big buckets with orphaned parts which use hundreds TBs of space.

And second problem that bucket check can't fix it if we have sharded bucket.
We have to reshard big buckets to 0 shards and fix it. But we can't reshard very big buckets (200-500m objects) to 0 shards because it can lead to another problems with osd crash...
and fix will eat a lot of memory..

Actions

Copy link

#14

Updated by Matt Benjamin over 1 year ago

Status changed from In Progress to Fix Under Review
Backport set to quincy

Actions

Copy link

#15

Updated by Casey Bodley over 1 year ago

Related to Bug #58369: When uploading parts in multipart upload, use the "AbortMultipartUpload" interface to end the upload, and there will be data that cannot be cleaned added

Actions

Copy link

#16