Bug #16767
closedRadosGW Multipart Cleanup Failure
100%
Description
My current setup is a Ceph Hammer cluster running 0.94.6. The rest of the cluster details are irrelevant to this issue.
I've stumbled upon an issue whereby RGW is not cleaning up properly after a multipart upload is completed (either abort or complete). If a client re-uploads a part during a multipart upload, ceph will store both the original and new part, but only the latter part will be valid when POSTing the CompleteMultipartUpload XML payload. When the multipart upload is completed (either through abort or complete), only the initial parts will be removed from the system. The remaining parts are orphaned and are not (easily) removable.
To reproduce:
First, create four 5MiB files with unique md5 sums:
dd if=/dev/urandom of=/tmp/part1.1 bs=1M count=5 dd if=/dev/urandom of=/tmp/part1.2 bs=1M count=5 dd if=/dev/urandom of=/tmp/part2.1 bs=1M count=5 dd if=/dev/urandom of=/tmp/part2.2 bs=1M count=5
Next, initiate a multipart upload:
s3curl --id test -- -X POST http://ceph.cluster/bucket/mpobject?uploads
Upload the parts:
s3curl --id test --put /tmp/part1.1 -- http://ceph.cluster/bucket/mpobject?partNumber=1&uploadId=2~whateverid s3curl --id test --put /tmp/part1.2 -- http://ceph.cluster/bucket/mpobject?partNumber=2&uploadId=2~whateverid s3curl --id test --put /tmp/part2.1 -- http://ceph.cluster/bucket/mpobject?partNumber=1&uploadId=2~whateverid s3curl --id test --put /tmp/part2.2 -- http://ceph.cluster/bucket/mpobject?partNumber=2&uploadId=2~whateverid
Now, let's take a look at what RGW says about the bucket:
radosgw-admin bucket stats --bucket=bucket | grep -A7 mptest | grep -v owner | grep -v instance "name": "mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.1", "namespace": "multipart", "size": 5242880, "mtime": "2016-07-21 18:43:15.000000Z", "etag": "785dec7eeb68366cca5c19cec86c508b", -- "name": "mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.2", "namespace": "multipart", "size": 5242880, "mtime": "2016-07-21 18:43:24.000000Z", "etag": "b11c15f456f17ba763d0fb900d22376c", -- "name": "mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.meta", "namespace": "multipart", "size": 0, "mtime": "2016-07-21 18:43:00.000000Z", "etag": "", -- "name": "mptest.feXQAxbcmjR1WdN_-b-jj1BKcObJ3Q6.2", "namespace": "multipart", "size": 5242880, "mtime": "2016-07-21 18:43:39.000000Z", "etag": "2d26aa403bc759305d0ea61d29f17cd0", -- "name": "mptest.i0q6uZ-do4mYoW7z5z8JDAQitcGJ5No.1", "namespace": "multipart", "size": 5242880, "mtime": "2016-07-21 18:43:31.000000Z", "etag": "a9fdb9efe0722f6e61d5d4ff3dfe0e81",
So we now have a meta file that contains the upload id, the first two attempted parts containing the upload id in the name, and the two subsequent parts that do not contain the upload id in the name.
Now, let's list the available parts associated with the id:
./s3curl --id test -- http://ceph.cluster/bucket/mpobject?uploadId=2~whateverid | xmlstarlet fo <?xml version="1.0" encoding="UTF-8"?> <ListPartsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Bucket>bucket</Bucket> <Key>mptest</Key> <UploadId>2~whateverid</UploadId> ... <Owner> <ID>7e1af43925cbef79334d2da290d602d586d04d7dd9aeb970c95ab93c0641c1f4</ID> <DisplayName>t3os_test</DisplayName> </Owner> <Part> <LastModified>2016-07-21T18:43:31.000Z</LastModified> <PartNumber>1</PartNumber> <ETag>a9fdb9efe0722f6e61d5d4ff3dfe0e81</ETag> <Size>5242880</Size> </Part> <Part> <LastModified>2016-07-21T18:43:39.000Z</LastModified> <PartNumber>2</PartNumber> <ETag>2d26aa403bc759305d0ea61d29f17cd0</ETag> <Size>5242880</Size> </Part> </ListPartsResult>
We see here that the available parts are the last two uploaded. So far, so good.
Now, let's go ahead and complete this thing.
{builds valid CompeteMultipartUpload document} ./s3curl --id test --post mp.test -- http://ceph.cluster/bucket/mpobject?uploadId=2~whateverid
Great success! I can now download the object, and it shows to be the valid combination of the last two parts I uploaded.
Now, however, let's take a look at our bucket:
radosgw-admin bucket list --bucket=bucket | grep -A7 mptest | grep -v owner | grep -v instance "name": "mptest.feXQAxbcmjR1WdN_-b-jj1BKcObJ3Q6.2", "namespace": "multipart", "size": 5242880, "mtime": "2016-07-21 18:43:39.000000Z", "etag": "2d26aa403bc759305d0ea61d29f17cd0", -- "name": "mptest.i0q6uZ-do4mYoW7z5z8JDAQitcGJ5No.1", "namespace": "multipart", "size": 5242880, "mtime": "2016-07-21 18:43:31.000000Z", "etag": "a9fdb9efe0722f6e61d5d4ff3dfe0e81", -- "name": "mptest", "namespace": "", "size": 10485760, "mtime": "2016-07-21 18:52:23.000000Z", "etag": "39967388ccf40f9570e7f3154549e589-2",
Upon completing the request, only the two parts tagged with the upload id are removed from the system. If I list out the .rgw.buckets pool, I can confirm that all of the parts are still present:
rados -p .rgw.buckets ls | grep mptest default.7754.6__shadow_mptest.feXQAxbcmjR1WdN_-b-jj1BKcObJ3Q6.2_1 default.7754.6_mptest default.7754.6__multipart_mptest.feXQAxbcmjR1WdN_-b-jj1BKcObJ3Q6.2 default.7754.6__multipart_mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.2 default.7754.6__shadow_mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.2_1 default.7754.6__multipart_mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.1 default.7754.6__shadow_mptest.i0q6uZ-do4mYoW7z5z8JDAQitcGJ5No.1_1 default.7754.6__shadow_mptest.2~o2LrKVtYqA_cwHAypOprHT-ANmTeH4S.1_1 default.7754.6__multipart_mptest.i0q6uZ-do4mYoW7z5z8JDAQitcGJ5No.1
Aborting the upload yields similar results, except in reverse. In the abort case, the files that contain the upload id in the name will be retained, but the other files will be properly removed.
For small multipart uploads like this, the additional space used is trivial. But in our actual cluster, we have clients that are uploading considerably larger files and are noticing that their bucket utilization is tens of TB larger than the sum of the objects they can list. The files are not removed by garbage collection, and are generally only removable through a very slow process of listing the omap contents of the bucket shards in .rgw.buckets.index and removing the omap keys that cannot be found.
Updated by Brian Felton over 7 years ago
With apologies for pestering, is there anyone looking into this? This bug affects the viability of Ceph as a backing store for a commercial product, as it can't be relied upon as a canonical source of truth for reporting storage utilization.
Also, what I reported earlier about being able to cleaning .rgw.buckets.index as a workaround was premature. While I can manually remove entries from .rgw.buckets and clean up orphans in .rgw.buckets.index, I've not found the secret sauce for actually getting the bucket's utilization to reflect the changes.
Updated by William Schroeder over 6 years ago
https://github.com/ceph/ceph/pull/17349 addresses the cleanup aspect of this bug; we wrote a tool that deletes the leaked multipart objects. The code is awaiting review, in case our change has unintended side-effects.
Updated by George Mihaiescu over 6 years ago
It would be so great if this tool would be reviewed and back-ported to Jewel.
Updated by Chris Jones about 4 years ago
This is still an issue even in luminous 12.2.12 and even in the nautilus versions we tested. Orphan find is impractical to run on large clusters. Has there been any work to address this issue?
Updated by Vicki Good over 3 years ago
I've encountered this bug in Ceph 14 and 15 and it's a pretty big problem for us for the same reason Brian Felton mentioned. It affects our storage utilization reporting.
I have been unable to manually remove these left-behind objects from the pools, but running radosgw-admin bucket check --fix --check-objects
does clean them up for buckets that are not sharded. That command does not work on sharded buckets. Even if it did work for all buckets, we would have to run it constantly for all buckets--not at all practical.
Is is possible to increase the priority of this bug?
Updated by Casey Bodley about 3 years ago
- Related to Bug #44660: Multipart re-uploads cause orphan data added
Updated by Rok Jaklic over 1 year ago
Vicki Good wrote:
I've encountered this bug in Ceph 14 and 15 and it's a pretty big problem for us for the same reason Brian Felton mentioned. It affects our storage utilization reporting.
I have been unable to manually remove these left-behind objects from the pools, but running
radosgw-admin bucket check --fix --check-objects
does clean them up for buckets that are not sharded. That command does not work on sharded buckets. Even if it did work for all buckets, we would have to run it constantly for all buckets--not at all practical.Is is possible to increase the priority of this bug?
We've encountered this bug also in Ceph 16.
It is pretty big problem for us also, since we do provisioning for users based on size_actual.
Updated by Matt Benjamin over 1 year ago
- Status changed from New to In Progress
- Assignee changed from Orit Wasserman to Matt Benjamin
Updated by Casey Bodley over 1 year ago
- Has duplicate Bug #57942: rgw leaks rados objects when a part is submitted multiple times in a multipart upload added
Updated by Aleksandr Rudenko over 1 year ago
It is very big problem for us.
We have a lot of big buckets with orphaned parts which use hundreds TBs of space.
And second problem that bucket check can't fix it if we have sharded bucket.
We have to reshard big buckets to 0 shards and fix it. But we can't reshard very big buckets (200-500m objects) to 0 shards because it can lead to another problems with osd crash...
and fix will eat a lot of memory..
Updated by Matt Benjamin over 1 year ago
- Status changed from In Progress to Fix Under Review
- Backport set to quincy
Updated by Casey Bodley over 1 year ago
- Related to Bug #58369: When uploading parts in multipart upload, use the "AbortMultipartUpload" interface to end the upload, and there will be data that cannot be cleaned added
Updated by J. Eric Ivancich about 1 year ago
- Related to Bug #58780: scan for orphaned rados objects and index entries in rgw suite added
Updated by Konstantin Shalygin about 1 year ago
- Status changed from Fix Under Review to Pending Backport
- Assignee deleted (
Matt Benjamin) - Target version set to v18.0.0
- % Done changed from 0 to 80
- Backport changed from quincy to quincy pacific reef
- Pull request ID changed from 37260 to 49709
Updated by Backport Bot about 1 year ago
- Copied to Backport #59064: reef: RadosGW Multipart Cleanup Failure added
Updated by Backport Bot about 1 year ago
- Copied to Backport #59065: quincy: RadosGW Multipart Cleanup Failure added
Updated by Backport Bot about 1 year ago
- Copied to Backport #59066: pacific: RadosGW Multipart Cleanup Failure added
Updated by Backport Bot about 1 year ago
- Tags changed from rgw multipart to rgw multipart backport_processed
Updated by Konstantin Shalygin 4 months ago
- Status changed from Pending Backport to Resolved
- % Done changed from 80 to 100