Bug #11749
rgw: rados objects wronly deleted
60%
Description
When testing rados gateway(giant v0.87), I found two bugs which would be dangerous enough to cause data corruption. The first one is fixed by Yehuda on https://github.com/ceph/ceph/pull/4661, in such condition, the first stripe of a part will be lost. Then I tested with upstream code, and found is still exists. Here is the details of the second bug.
This bug only happened with multipart is enabled. When uploading object with multiple parts, a part which is not completely uploaded will be destroyed by callling dispose_processor. This would occasionally cause race conditon: the first upload would be possible to delete objects belong to the second upload, which would finally cause data corruption.
I set multipart size to 64MB and test it with s3cmd, actually I reproduced it with cyberduck, too. You can reproduce it with the following script:
dd if=/dev/zero of=BREAKDOWN bs=65M count=1
originalMD5=`md5sum ./BREAKDOWN | awk '{print $1}'`
s3cmd put $FILENAME s3://BREAKDOWN/$FILENAME &
sleep 2
kill -9 `ps aux | grep "s3cmd put" | grep -v grep | awk '{print $2}'`
s3cmd put $FILENAME s3://BREAKDOWN/$FILENAME
s3cmd get s3://BREAKDOWN/$FILENAME downloadedfile --force
downloadMD5=`md5sum ./downloadedfile | awk '{print $1}'`
if [[ "$originalMD5" == "$downloadMD5" ]] ;then
echo "bad MD5"
exit -1
fi
You can hit "bad MD5" bug after run enough times, when list the rados objects by "rados list -p .rgw.buckets | grep BREAKDOWN", you will find some rados objects would have already been deleted. Here is a sample:
[root@cephdev141 src]$./rados ls -p .rgw.buckets | grep 20150523164305 | sort
default.54105.4_20150523164305
default.54105.4__multipart_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1
default.54105.4__multipart_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.2
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_1
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_10
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_11
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_12
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_13
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_14
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_15
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_2
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_5
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_6
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_7
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_8
default.54105.4__shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_9
In this sample, default.54105.4_shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_3_ and default.54105.4_shadow_20150523164305.2~JeapILhRaTUmiiLjBFOdYRiaBgOIUVo.1_4_ were wrongly deleted.
Related issues
Associated revisions
rgw: fix data corruption when race condition
We should delete the object in the multipart namespace lastly to prevent a previous upload
wrongly deleting objects belong to the following upload.
Fixes: #11749
Signed-off-by: wuxingyi <wuxingyi@letv.com>
rgw: fix data corruption when race condition
We should delete the object in the multipart namespace lastly to prevent a previous upload
wrongly deleting objects belong to the following upload.
Fixes: #11749
Signed-off-by: wuxingyi <wuxingyi@letv.com>
(cherry picked from commit ac1e729a75b5d995028bbc223bcf5ecce0d112cc)
History
#1 Updated by Kefu Chai over 8 years ago
- Status changed from New to Fix Under Review
- Source changed from Development to Community (dev)
#2 Updated by Yehuda Sadeh over 8 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to hammer
Merged in commit:69989ffa3cabe209404504edd24b1d2a53e33e15.
backporting to firefly will also require picking up fix for #10311.
#3 Updated by Abhishek Lekshmanan over 8 years ago
Hammer backport: https://github.com/ceph/ceph/pull/5117
#4 Updated by Gleb Borisov over 8 years ago
Is there any way to find objects affected by this issue in bucket?
#5 Updated by Loïc Dachary over 8 years ago
- Status changed from Pending Backport to Resolved