Project

General

Profile

Bug #15745

RGW :: Subset of uploaded objects via radosgw are unretrievable when using erasure coded pool

Added by Mike Beyer over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
-
Start date:
05/05/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel, infernalis, hammer
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:

Description

We are running a number of Ceph clusters in production to provide object storage services. We have stumbled upon an issue where objects of certain sizes are irretrievable. The symptoms are very similar to the fix referenced here: https://www.redhat.com/archives/rhsa-announce/2015-November/msg00060.html. We can put objects into the cluster via s3/radosgw, but we cannot retrieve them (cluster closes the connection without delivering all bytes). Unfortunately, this fix does not apply to us, as we are and have always been running Hammer. We've stumbled on a brand-new edge case.

We have produced this issue on the 0.94.3, 0.94.4, and 0.94.6 releases of Hammer.

We have produced this issues using three different storage hardware configurations -- 5 instances of clusters running 648 6TB OSDs across nine physical nodes, 1 cluster running 30 10GB OSDs across ten VM nodes, and 1 cluster running 288 6TB OSDs across four physical nodes.

We have determined that this issue only occurs when using erasure coding (we've only tested plugin=jerasure technique=reed_sol_van ruleset-failure-domain=host).
Objects of exactly 4.5MiB (4718592 bytes) can be placed into the cluster but not retrieved. At every interval of `rgw object stripe size` thereafter (in our case, 4 MiB), the objects are similarly irretrievable. We have tested this from 4.5 to 24.5 MiB, then have spot-checked for much larger values to prove the pattern holds. There is a small range of bytes less than this boundary that are irretrievable. After much testing, we have found this boundary to be strongly correlated with the k value in our erasure coded pool. We have observed that the m value in the erasure coding has no effect on the window size. We have tested erasure coded values of k from 2 to 9, and we've observed the following ranges:

k = 2, m = 1 -> No error
k = 3, m = 1 -> 32 bytes (i.e. errors when objects are inclusively between 4718561 - 4718592 bytes)
k = 3, m = 2 -> 32 bytes
k = 4, m = 2 -> No error
k = 4, m = 1 -> No error
k = 5, m = 4 -> 128 bytes
k = 6, m = 3 -> 512 bytes
k = 6, m = 2 -> 512 bytes
k = 7, m = 1 -> 800 bytes
k = 7, m = 2 -> 800 bytes
k = 8, m = 1 -> No error
k = 9, m = 1 -> 800 bytes

The "bytes" represent a 'dead zone' object size range wherein objects can be put into the cluster but not retrieved. The range of bytes is 4.5MiB - (4.5MiB - buffer - 1) bytes. Up until k = 9, the error occurs for values of k that are not powers of two, at which point the "dead zone" window is (k-2)^2 * 32 bytes. My team has not been able to determine why we plateau at 800 bytes (we expected a range of 1568 bytes here).
This issue cannot be reproduced using rados to place objects directly into EC pools. The issue has only been observed with using RadosGW's S3 interface.
The issue can be reproduced with any S3 client (s3cmd, s3curl, CyberDuck, CloudBerry Backup, and many others have been tested).

At this point, we are evaluating the Ceph codebase in an attempt to patch the issue. As this is an issue affecting data retrievability (and possibly integrity), we wanted to bring this to the attention of the community as soon as we could reproduce the issue. We are hoping both that others out there can independently verify and possibly that some with a more intimate understanding of the codebase could investigate and propose a fix. We have observed this issue in our production clusters, so it is a very high priority for my team.

Furthermore, we believe the objects to be corrupted at the point they are placed into the cluster. We have tested copying the .rgw.buckets pool to a non-erasure coded pool, then swapping names, and we have found that objects copied from the EC pool to the non-EC pool to be irretrievable once RGW is pointed to the non-EC pool. If we overwrite the object in the non-EC pool with the original, it becomes retrievable again. Upon copying the data back into to EC pool, the data uploaded from the non-EC pool is retrievable but new uploads suffer the same issue.

radosgw.log View (119 KB) Mike Beyer, 05/05/2016 05:22 PM


Related issues

Copied to rgw - Backport #15831: jewel: RGW :: Subset of uploaded objects via radosgw are unretrievable when using erasure coded pool Resolved
Copied to rgw - Backport #15832: infernalis: RGW :: Subset of uploaded objects via radosgw are unretrievable when using erasure coded pool Rejected
Copied to rgw - Backport #15833: hammer: RGW :: Subset of uploaded objects via radosgw are unretrievable when using erasure coded pool Resolved

History

#1 Updated by Yehuda Sadeh over 2 years ago

Can you provide rgw log with 'debug rgw = 20' and 'debug ms = 1' of such an object creation? Also useful would be a log of trying to read the object.

#2 Updated by Mike Beyer over 2 years ago

Logs for a put/get cycle here. I tried to pare it down some.

http://pastebin.com/YP5ZLMZV

#3 Updated by Mike Beyer over 2 years ago

pastebin truncated the file, i've attached the complete version that shows the error from the GET.

#4 Updated by Yehuda Sadeh over 2 years ago

The log doesn't have 'debug ms = 1', can you add that and reproduce?

#5 Updated by Mike Beyer over 2 years ago

Will upload again.

New information: When I upload an object via rgw to an erasure coded pool and pull the pieces out of rados directly, they can be put back together to match the initial objects md5sum.

Repro steps below.

dd if=/dev/urandom of=~/Downloads/test/4718592 bs=1 count=4718592
md5 4718592
MD5 (4718592) = 22a8c42dab2940ced1ea57f15673be87

put into bucket via s3

rados -p .rgw.buckets get default.24161.19_4718592 /tmp/default.24161.19_4718592
rados -p .rgw.buckets get default.24161.19__shadow_.P1mbQc2M_4XREUkxJmnDxJrYRGyZUbb_1 /tmp/default.24161.19__shadow_.P1mbQc2M_4XREUkxJmnDxJrYRGyZUbb_1

cat default.24161.19_4718592 >> file.out
cat default.24161.19__shadow_.P1mbQc2M_4XREUkxJmnDxJrYRGyZUbb_1 >> file.out

md5sum file.out
22a8c42dab2940ced1ea57f15673be87 file.out

#6 Updated by Mike Beyer over 2 years ago

Logs with debug ms = 1 & debug rgw = 20

#7 Updated by Mike Beyer over 2 years ago

Mike Beyer wrote:

Logs with debug ms = 1 & debug rgw = 20

#8 Updated by Mike Beyer over 2 years ago

Logs with debug ms = 1 & debug rgw = 20

#9 Updated by Yehuda Sadeh over 2 years ago

still there's no 'debug ms = 1' in these logs. Did you restart the radosgw process after changing ceph.conf? Or maybe changed it on a different ceph.conf?

#10 Updated by Mike Beyer over 2 years ago

sorry about the comment spam, the logs were failing to upload

dropbox link below ->

https://www.dropbox.com/s/80k93ezg3mwt6h8/radosgw2.log?dl=0

#11 Updated by Yehuda Sadeh over 2 years ago

  • Backport set to jewel, infernalis, hammer

#12 Updated by Kefu Chai over 2 years ago

  • Project changed from Ceph to rgw

#13 Updated by Yehuda Sadeh over 2 years ago

  • Status changed from New to Pending Backport

#14 Updated by Nathan Cutler over 2 years ago

  • Copied to Backport #15831: jewel: RGW :: Subset of uploaded objects via radosgw are unretrievable when using erasure coded pool added

#15 Updated by Nathan Cutler over 2 years ago

  • Copied to Backport #15832: infernalis: RGW :: Subset of uploaded objects via radosgw are unretrievable when using erasure coded pool added

#16 Updated by Nathan Cutler over 2 years ago

  • Copied to Backport #15833: hammer: RGW :: Subset of uploaded objects via radosgw are unretrievable when using erasure coded pool added

#18 Updated by Loic Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF