Project

General

Profile

Bug #15886

Multipart Object Corruption

Added by Brian Felton over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
-
Start date:
05/13/2016
Due date:
% Done:

0%

Source:
other
Tags:
radosgw multipart rgw
Backport:
hammer,jewel
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
rgw
Pull request ID:

Description

This is similar to the issue reported in http://tracker.ceph.com/issues/13764.

We run several large Ceph clusters. Possibly relevant details:

- 648 6TB OSDs across nine nodes (all spinners)
- 7,2 erasure coded .rgw.buckets pool, 4 MiB stripe size
- Cluster is running Hammer 0.94.3

We have recently encountered several occurrences of corrupted multipart objects. In all cases, the corrupted files are partially retrievable. In all cases, the corruption takes the form of a zero-byte _multipart rados object, and in all cases, the tail _shadow files are complete. In other words, we can retrieve all parts up until the 0-byte multipart object, and the objects are all missing exactly 4 MiB of data. We have observed this corruption in objects as small as 18MB and as large as 19 GB. In all cases, we have only found a single multipart object that has been zeroed out out of the entire object. In all cases, it has been one of the first few parts of the object that are affected. If we stat all of the parts, we'll see something akin to the following:

.rgw.buckets/default.184914.176__multipart_object.prefix.1 mtime 2016-05-05 08:20:27.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.1_1 mtime 2016-05-05 08:20:25.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.1_2 mtime 2016-05-05 08:20:27.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.1_3 mtime 2016-05-05 08:20:27.000000, size 3145728
.rgw.buckets/default.184914.176__multipart_object.prefix.2 mtime 2016-05-05 08:20:28.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.2_1 mtime 2016-05-05 08:20:28.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.2_2 mtime 2016-05-05 08:20:28.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.2_3 mtime 2016-05-05 08:20:28.000000, size 3145728
.rgw.buckets/default.184914.176__multipart_object.prefix.3 mtime 2016-05-05 08:26:41.000000, size 0
.rgw.buckets/default.184914.176__shadow_object.prefix.3_1 mtime 2016-05-05 08:20:29.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.3_2 mtime 2016-05-05 08:20:29.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.3_3 mtime 2016-05-05 08:26:40.000000, size 3145728
.rgw.buckets/default.184914.176__multipart_object.prefix.4 mtime 2016-05-05 08:25:35.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.4_1 mtime 2016-05-05 08:25:35.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.4_2 mtime 2016-05-05 08:25:35.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.4_3 mtime 2016-05-05 08:25:35.000000, size 3145728
.rgw.buckets/default.184914.176__multipart_object.prefix.5 mtime 2016-05-05 08:25:37.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.5_1 mtime 2016-05-05 08:25:36.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.5_2 mtime 2016-05-05 08:25:37.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.5_3 mtime 2016-05-05 08:25:37.000000, size 3145728
.rgw.buckets/default.184914.176__multipart_object.prefix.6 mtime 2016-05-05 08:25:39.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.6_1 mtime 2016-05-05 08:25:38.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.6_2 mtime 2016-05-05 08:25:38.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.6_3 mtime 2016-05-05 08:25:38.000000, size 3145728
etc.

The corruption has occurred across multiple S3 clients and chunk sizes.

All instances of corruption have occurred long ago enough that we do not have the corresponding rgw logs. We do feed logs from our system into an ELK stack, so I have the ability to revisit some data, but the rgw logs are not part of the ELK feed. Using these data, I can correlate the times of corruption to instances when the cluster was reporting slow requests (longest slow requests were 120s). None of the occurrences correspond to any network or software stack failures. Typically, we are processing multiple object PUTs per second, and we've found no corruption in objects written to the cluster at the same time (where 'same time' means within seconds/milliseconds).

We haven't been able to identify any other patterns or clues to reproduce the issue, although we are actively working to do so. If and when we can reproduce, we will attach details on reproduction and include rgw logs with debugging enabled. If there are any additional details we can provide or reproduction steps we can attempt, please ask.


Related issues

Related to rgw - Bug #13764: Radosgw incomplete files New 11/10/2015
Copied to rgw - Backport #15979: hammer: Multipart Object Corruption Resolved
Copied to rgw - Backport #15980: jewel: Multipart Object Corruption Resolved

History

#1 Updated by Yehuda Sadeh over 2 years ago

One question that I have, what was the original prefix in the above listing? Was it the same for all objects?

Note the mtime difference, which is interesting:

.rgw.buckets/default.184914.176__shadow_object.prefix.2_3 mtime 2016-05-05 08:20:28.000000, size 3145728

.rgw.buckets/default.184914.176__multipart_object.prefix.3 mtime 2016-05-05 08:26:41.000000, size 0
.rgw.buckets/default.184914.176__shadow_object.prefix.3_1 mtime 2016-05-05 08:20:29.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.3_2 mtime 2016-05-05 08:20:29.000000, size 4194304
.rgw.buckets/default.184914.176__shadow_object.prefix.3_3 mtime 2016-05-05 08:26:40.000000, size 3145728

.rgw.buckets/default.184914.176__multipart_object.prefix.4 mtime 2016-05-05 08:25:35.000000, size 4194304

it would point at a retry scenario, where an upload took too long so the client library retried sending the part. One thing also to note is that the first object in the part (the one that is prefixed by multipart) has mtime that is later than the last object in that part. This would make sense for regular objects, but shouldn't happen for multipart objects, so it might be the key to understanding the issue.

#2 Updated by Brian Felton over 2 years ago

Yehuda,

There was no timeout or retry on the client's end -- I have the complete set of haproxy logs to verify that. We spotted the same gap between the completion of the third part and beginning of the PUT on the fourth, but there were no errors returned and no retry attempts from the client. But yes, we also noticed and found the mtime difference very interesting.

As for the prefix, it was the same for this object, but it contains some data that I needed to scrub. Here's another that requires less sanitation:

.rgw.buckets/default.574155.95__multipart_backup-2016.04.18/4/__2c.2~PvbS5J4E9kRkCID7kBbofdwYyD0yMol.1 mtime 2016-05-06 08:34:05.000000, size 4194304
.rgw.buckets/default.574155.95__shadow_backup-2016.04.18/4/__2c.2~PvbS5J4E9kRkCID7kBbofdwYyD0yMol.1_1 mtime 2016-05-06 08:34:04.000000, size 1048576
.rgw.buckets/default.574155.95__multipart_backup-2016.04.18/4/__2c.2~PvbS5J4E9kRkCID7kBbofdwYyD0yMol.2 mtime 2016-05-06 08:35:24.000000, size 0
.rgw.buckets/default.574155.95__shadow_backup-2016.04.18/4/__2c.2~PvbS5J4E9kRkCID7kBbofdwYyD0yMol.2_1 mtime 2016-05-06 08:34:13.000000, size 1048576
.rgw.buckets/default.574155.95__multipart_backup-2016.04.18/4/__2c.2~PvbS5J4E9kRkCID7kBbofdwYyD0yMol.3 mtime 2016-05-06 08:35:22.000000, size 4194304
.rgw.buckets/default.574155.95__shadow_backup-2016.04.18/4/__2c.2~PvbS5J4E9kRkCID7kBbofdwYyD0yMol.3_1 mtime 2016-05-06 08:35:22.000000, size 1048576
.rgw.buckets/default.574155.95__multipart_backup-2016.04.18/4/__2c.2~PvbS5J4E9kRkCID7kBbofdwYyD0yMol.4 mtime 2016-05-06 08:35:30.000000, size 2279885

This showed the same pattern of behavior in the haproxy logs (read: no errors, no retries), albeit with a far smaller gap between parts 2 and 3.

As for the mtime being more recent on the multipart object, this is common across most all multipart objects in our system. I haven't dug through the code; I assumed that the mtime adjustment was due to rgw touching/updating an xattr on the object.

#3 Updated by Yehuda Sadeh over 2 years ago

The mtime of the first object in the part is later because we finish by writing its metadata, so the first object of every part is expected to be touched twice, once when we write its data, and once when we finish writing the part. My guess is that it was written correctly, something removed it, then we finished by writing the metadata.

#4 Updated by Yehuda Sadeh over 2 years ago

btw, could it be that haproxy itself is retrying to send to the backend due to timeout?

#5 Updated by Yehuda Sadeh over 2 years ago

I did manage to spot a bug in the code that might cause the issue, but it requires that radosgw receives multiple requests with the same uploadId. If it's possible that haproxy replays the requests somehow, then it might be it.

The issue I'm seeing is as follows:
There's a code that guards that we don't have multiple writers to the same rados object, by using the multipart object (for each part) as a guard. However, there's a bug there that when we discover that there's already such an object written, we end up removing it. Usually it doesn't matter because we then continue to write the entire part from the beginning (with a different object names for the part). However, there's the case where the original write actually ends up finishing, and if it finishes after the second writer then it's really going to miss its first part (as we can see here).

#6 Updated by Nathan Cutler over 2 years ago

  • Related to Bug #13764: Radosgw incomplete files added

#7 Updated by Brian Felton over 2 years ago

Our haproxy installation is not set up with any retry logic, so retries would have to be initiated by the client. According to the logs for the 'backup-2016.04.18/4/__2c' object, the client did resend part 2. Logs follow:

May  6 08:34:01 haprox05 haproxy[9600]: 10.100.166.83:39578 [06/May/2016:08:33:59.136] s3proxy~ s3proxy_backend/s3proxy117 127/0/0/2015/2142 200 398 - - ---- 237/179/150/38/0 0/0 {bucket.ceph.io} "POST /backup-2016.04.18/4/__2c?uploads HTTP/1.1" 
May  6 08:34:10 haprox05 haproxy[9600]: 10.100.78.38:41507 [06/May/2016:08:34:01.401] os_cluster os_cluster_backend/rgw04 0/0/0/9036/9036 200 189 - - ---- 235/53/51/12/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=1 HTTP/1.1" 
May  6 08:34:10 haprox05 haproxy[9600]: 10.100.166.83:39578 [06/May/2016:08:34:01.277] s3proxy~ s3proxy_backend/s3proxy120 2/0/0/9159/9161 200 189 - - ---- 235/182/145/37/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=1 HTTP/1.1" 
May  6 08:35:19 haprox05 haproxy[9600]: 10.100.78.38:58354 [06/May/2016:08:35:04.272] os_cluster os_cluster_backend/rgw05 0/0/0/14909/14909 200 189 - - ---- 143/34/33/7/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=2 HTTP/1.1" 
May  6 08:35:19 haprox05 haproxy[9600]: 10.100.166.83:40309 [06/May/2016:08:35:04.103] s3proxy~ s3proxy_backend/s3proxy120 51/0/0/15028/15079 200 189 - - ---- 143/109/95/24/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=2 HTTP/1.1" 
May  6 08:35:24 haprox05 haproxy[9600]: 10.100.78.35:45692 [06/May/2016:08:35:19.425] os_cluster os_cluster_backend/rgw04 0/0/0/5075/5075 200 189 - - ---- 143/33/33/9/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=3 HTTP/1.1" 
May  6 08:35:24 haprox05 haproxy[9600]: 10.100.166.83:40309 [06/May/2016:08:35:19.182] s3proxy~ s3proxy_backend/s3proxy118 123/0/0/5196/5319 200 189 - - ---- 143/110/92/23/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=3 HTTP/1.1" 
May  6 08:35:25 haprox05 haproxy[9600]: 10.100.78.34:57456 [06/May/2016:08:34:10.687] os_cluster os_cluster_backend/rgw03 0/0/0/74657/74657 200 189 - - ---- 151/38/37/8/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=2 HTTP/1.1" 
May  6 08:35:25 haprox05 haproxy[9600]: 10.100.166.83:39578 [06/May/2016:08:34:10.438] s3proxy~ s3proxy_backend/s3proxy117 130/0/0/74776/74906 200 189 - - ---- 151/113/90/22/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=2 HTTP/1.1" 
May  6 08:35:31 haprox05 haproxy[9600]: 10.100.78.35:36423 [06/May/2016:08:35:24.675] os_cluster os_cluster_backend/rgw06 0/0/0/6411/6411 200 189 - - ---- 145/38/32/7/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=4 HTTP/1.1" 
May  6 08:35:31 haprox05 haproxy[9600]: 10.100.166.83:40309 [06/May/2016:08:35:24.501] s3proxy~ s3proxy_backend/s3proxy118 54/0/0/6532/6586 200 189 - - ---- 145/107/87/21/0 0/0 {bucket.ceph.io} "PUT /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol&partNumber=4 HTTP/1.1" 
May  6 08:35:32 haprox05 haproxy[9600]: 10.100.166.83:40309 [06/May/2016:08:35:31.087] s3proxy~ s3proxy_backend/s3proxy118 2/0/0/1768/1770 200 436 - - ---- 143/105/85/21/0 0/0 {bucket.ceph.io} "POST /backup-2016.04.18/4/__2c?uploadId=2%7EPvbS5J4E9kRkCID7kBbofdwYyD0yMol HTTP/1.1" 

I am reviewing our logs for the other failed objects to see if I can find a similar pattern with re-attempts.

#8 Updated by Brian Felton over 2 years ago

I've another observation to share. In trying to recreate this issue, I've been trying various permutations of uploading and re-uploading parts of a multipart object. In none of the cases have I been able to reproduce the issue. In observing the object in the cluster, though, I've noticed something interesting that doesn't appear in any of the broken objects.

This is the output of radosgw-admin object stat for my test object (part 3 was reuploaded):

radosgw-admin object stat --bucket=hailmary --object=breaknow
{
    "name": "breaknow",
    "size": 16252928,
    "etag": "5c1dae2acd8558c28c05767b9faf7b62-4\u0000",
    "tag": "default.337946.2089572\u0000",
    "manifest": {
        "objs": [],  
        "obj_size": 16252928,
        "explicit_objs": "false",
        "head_obj": {
            "bucket": {
                "name": "hailmary",
                "pool": ".rgw.buckets",
                "data_extra_pool": ".rgw.buckets.extra",
                "index_pool": ".rgw.buckets.index",
                "marker": "default.25591.4",
                "bucket_id": "default.25591.4" 
            },
            "key": "",
            "ns": "",
            "object": "breaknow",
            "instance": "" 
        },
        "head_size": 0,
        "max_head_size": 0,
        "prefix": "breaknow.sD5QEatY1rSlc3IwzWVfLhrp64i7ZW8",
        "tail_bucket": {
            "name": "hailmary",
            "pool": ".rgw.buckets",
            "data_extra_pool": ".rgw.buckets.extra",
            "index_pool": ".rgw.buckets.index",
            "marker": "default.25591.4",
            "bucket_id": "default.25591.4" 
        },
        "rules": [   
            {
                "key": 0,
                "val": {
                    "start_part_num": 1,
                    "start_ofs": 0,
                    "part_size": 5242880,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            },
            {
                "key": 5242880,
                "val": {
                    "start_part_num": 2,
                    "start_ofs": 5242880,
                    "part_size": 5242880,
                    "stripe_max_size": 4194304,
                    "override_prefix": "breaknow.1000" 
                }
            },
            {
                "key": 10485760,
                "val": {
                    "start_part_num": 3,
                    "start_ofs": 10485760,
                    "part_size": 5242880,
                    "stripe_max_size": 4194304,
                    "override_prefix": "breaknow.BJ3cPMn4fx2Cp4EMUq_Fp-wRRB11fKB" 
                }
            },
            {
                "key": 15728640,
                "val": {
                    "start_part_num": 4,
                    "start_ofs": 15728640,
                    "part_size": 524288,
                    "stripe_max_size": 4194304,
                    "override_prefix": "breaknow.1000" 
                }
            }
        ]
    },
    "attrs": {}
}

Here is the output for a broken object:

{   
    "name": "main.log",
    "size": 47690292,
    "etag": "4e8d79c801c83c8a583440dfcad8dbba-10\u0000",
    "tag": "default.819034.1975051\u0000",
    "manifest": {
        "objs": [],
        "obj_size": 47690292,
        "explicit_objs": "false",
        "head_obj": {
            "bucket": {
                "name": "troubled-bucket",
                "pool": ".rgw.buckets",
                "data_extra_pool": ".rgw.buckets.extra",
                "index_pool": ".rgw.buckets.index",
                "marker": "default.574155.61",
                "bucket_id": "default.574155.61" 
            },
            "key": "",
            "ns": "",
            "object": "main.log",
            "instance": "" 
        },
        "head_size": 0,
        "max_head_size": 0,
        "prefix": "main.log.2~w3kwvXnX1eVvRNKZqkSFHJBF_bps_Nx",
        "tail_bucket": {
            "name": "troubled-bucket",
            "pool": ".rgw.buckets",
            "data_extra_pool": ".rgw.buckets.extra",
            "index_pool": ".rgw.buckets.index",
            "marker": "default.574155.61",
            "bucket_id": "default.574155.61" 
        },
        "rules": [
            {
                "key": 0,
                "val": {
                    "start_part_num": 1,
                    "start_ofs": 0,
                    "part_size": 5242880,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            },
            {   
                "key": 47185920,
                "val": {
                    "start_part_num": 10,
                    "start_ofs": 47185920,
                    "part_size": 504372,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            }
        ]
    },
    "attrs": {
        "user.rgw.content_type": "application\/octet-stream\u0000" 
    }
}

So what's interesting to me is that reuploading a part in my tests causes two things -- 1) the prefix for the object is recalculated; 2) a different override prefix is added to the reuploaded part, while the non-re-uploaded parts retain the object's original prefix. The effect this has on the object in the cluster is that the objects are "scattered" around the cluster -- i.e. one has to search for the pieces by following the various override_prefix values. In all of the broken objects in our cluster, no override prefix has ever been calculated; all parts (including those of zero size) are stored under the prefix (i.e. <bucket_id>__multipart_<object_name>.<multipart_prefix>.<multipart_id> and <bucket_id>__shadow_<object_name>.<multipart_prefix>.<multipart_id>_<part_id>.

#9 Updated by Yehuda Sadeh over 2 years ago

Right. This reaffirms my understanding of the bug. The problem happens if there are two uploads of the same part, and that the first upload completes successfully (as observed by the client) after the second upload. So basically it's like that:

upload 1 starts
upload 2 starts
upload 2 finishes
upload 1 finishes

#10 Updated by Yehuda Sadeh over 2 years ago

  • Priority changed from High to Urgent

#12 Updated by Yehuda Sadeh over 2 years ago

Also, a test that reproduces the issue:
https://github.com/ceph/s3-tests/pull/110

#13 Updated by Abhishek Varshney over 2 years ago

Is this issue seen only on EC pools, or does it happen on replicated pools too?

#14 Updated by Brian Felton over 2 years ago

Abhishek Varshney,

We have only seen the issue on EC pools, but it does not appear that the changes submitted by Yehuda are EC-specific.

Yehuda,

We will be working on building and testing the patch today. We still don't have a reliable Real World(tm) test case, unfortunately, but we continue hunting.

#15 Updated by Yehuda Sadeh over 2 years ago

  • Status changed from New to Need Review

#16 Updated by Nathan Cutler over 2 years ago

  • Backport set to hammer,jewel

#17 Updated by Brian Felton over 2 years ago

Yehuda,

We've attempted to cherry-pick and build against your patch, but it doesn't appear that it is compatible with Hammer in its current state. Is there an ETA on a backport to Hammer?

Your assistance is greatly appreciated. Many thanks for the work so far.

#19 Updated by Yehuda Sadeh over 2 years ago

  • Status changed from Need Review to Pending Backport

#22 Updated by Nathan Cutler over 2 years ago

#23 Updated by Nathan Cutler over 2 years ago

#24 Updated by Loic Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

#25 Updated by George Mihaiescu over 2 years ago

Hi Brian,

Can you please confirm the patch worked for you on Hammer?

Loic, which version of Hammer contains this patch?

Thank you,
George

#26 Updated by Nathan Cutler over 2 years ago

@George: The hammer patch was only merged 11 days ago. It will appear in the next hammer point release 0.94.8.

Tickets are marked resolved when the PR is merged, i.e. always before actual release.

#27 Updated by Brian Felton over 2 years ago

Apologies for the lack of responses here, but I've been out on PTO.

@Nathan Cutlet, @Yehuda Sadeh:

This patch is NOT working for me. I have been able to successfully deploy and install in some test environments, and while the patch allows for the provided test case to pass, it fails when actually trying to retrieve multipart files. To reproduce (and I assume you can create a valid s3cfg and point things to a particular rgw):

<patch an rgw>
<point your test host to the patched rgw>
dd if=/dev/unrandom of=/tmp/8MiB bs=1M count=8
s3cmd mb s3://mptest-patched
s3cmd put /tmp/8MiB s3://mptest-patched/8MiB --multipart-chunk-size-mb=5
s3cmd get s3://mptest-patched/8MiB /tmp/EightMiB

In my tests (across multiple clusters now), the object retrieved will be 4MiB only.

# ll /tmp/*MiB
-rw-r--r-- 1 root root 8388608 Jun  7 13:30 /tmp/8MiB
-rw-r--r-- 1 root root 4194304 Jun  7 13:32 /tmp/EightMiB

The metadata from the object shows the correct size:

s3cmd ls s3://mptest-patched
2016-06-07 13:32   8388608   s3://mptest-patched/8MiB

If I look in the cluster, all of the data appear to be present:

# rados -p .rgw.buckets stat "default.25941.2__multipart_8MiB.2~7HMRML_6La66Dn7AqEa9WV3wLHCny5Z.1" 
.rgw.buckets/default.25941.2__multipart_8MiB.2~7HMRML_6La66Dn7AqEa9WV3wLHCny5Z.1 mtime 2016-06-07 13:32:22.000000, size 4194304
# rados -p .rgw.buckets stat "default.25941.2__shadow_8MiB.2~7HMRML_6La66Dn7AqEa9WV3wLHCny5Z.1_1" 
.rgw.buckets/default.25941.2__shadow_8MiB.2~7HMRML_6La66Dn7AqEa9WV3wLHCny5Z.1_1 mtime 2016-06-07 13:32:22.000000, size 1048576
# rados -p .rgw.buckets stat "default.25941.2__multipart_8MiB.2~7HMRML_6La66Dn7AqEa9WV3wLHCny5Z.2" 
.rgw.buckets/default.25941.2__multipart_8MiB.2~7HMRML_6La66Dn7AqEa9WV3wLHCny5Z.2 mtime 2016-06-07 13:32:23.000000, size 3145728

Can someone else perform this testing and let me know if this is a bug or if I've done something very, very wrong here? My RGW is built from 0.94.6 with the addition of the patches for this issue and http://tracker.ceph.com/issues/15745. I suspect building with 0.94.6 + this patch will be sufficient. I had missed this earlier because I wasn't verifying the file size and because I could see the parts were in the cluster.

#28 Updated by Brian Felton over 2 years ago

To add some clarity to the above, the issue seems to be on insertion. If I use an unpatched rgw to retrieve from the bucket populated by the patched rgw, the resulting files will also be truncated to 4MiB. If I use a patched rgw to retrieve data from a bucket populated by an unpatched rgw, the retrieved object has the appropriate size.

Furthermore, if I attempt to put files that are not multipart, the put fails completely. Similar tests against an unpatched rgw work as expected.

#29 Updated by George Mihaiescu over 2 years ago

Thanks for the update Brian and I hope this path IS NOT ADDED to the next Hammer release without more testing, as it seems to break things more than it fixes them.

#30 Updated by Brian Felton over 2 years ago

One final update. Playing around with radosgw-admin object stat, I think I see the problem.

Here is the output from an 8MiB file put with the patched rgw:

        "rules": [
            {
                "key": 0,
                "val": {
                    "start_part_num": 1,
                    "start_ofs": 0,
                    "part_size": 4194304,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            },
            {
                "key": 4194304,
                "val": {
                    "start_part_num": 2,
                    "start_ofs": 4194304,
                    "part_size": 0,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            }
        ]

Here is the same data from an unpatched rgw:

        "rules": [
            {
                "key": 0,
                "val": {
                    "start_part_num": 1,
                    "start_ofs": 0,
                    "part_size": 5242880,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            },
            {
                "key": 5242880,
                "val": {
                    "start_part_num": 2,
                    "start_ofs": 5242880,
                    "part_size": 3145728,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            }
        ]

So now it makes perfect sense as to why it's only retrieving 4MiB of data.

If I do the same test with 16MiB of data with a 5MiB chunk size, I can retrieve the first 12MiB. Basically, the patch is forcing the part size to 4MiB, which is going to do some bad things.

Here is the object stat output for the 16MiB file when put with the patched rgw:

        "rules": [
            {   
                "key": 0,
                "val": {
                    "start_part_num": 1,
                    "start_ofs": 0,
                    "part_size": 4194304,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            },
            {   
                "key": 12582912,
                "val": {
                    "start_part_num": 4,
                    "start_ofs": 12582912,
                    "part_size": 0,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            }
        ]

And here's the same output from an unpatched rgw:

        "rules": [
            {   
                "key": 0,
                "val": {
                    "start_part_num": 1,
                    "start_ofs": 0,
                    "part_size": 5242880,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            },
            {   
                "key": 15728640,
                "val": {
                    "start_part_num": 4,
                    "start_ofs": 15728640,
                    "part_size": 1048576,
                    "stripe_max_size": 4194304,
                    "override_prefix": "" 
                }
            }
        ]

#31 Updated by Yehuda Sadeh over 2 years ago

Brian, I was trying to reproduce the issue you described, and couldn't. Just to confirm, you're running 0.94.6 + the following commits:
fec62882726beed8e7455bba2be6f72cdbba0542
54910f843a0c78ab7ae1e2c225e8ce39624e23c5

and have ec (7 + 2) backed .rgw.buckets pool. You also mentioned 4MB stripe size, did you mean the 4MB chunk size? How did you configure that?

#32 Updated by Brian Felton over 2 years ago

Yehuda,

I grabbed commits b7a7d41839558b1a8786a66e27d9d80e606f61c7 (issue 15745) and 8c469997e4f03e316487770b8e22d27f7e4930ec (this issue); the commits were from the hammer PRs listed in the corresponding issues. Both of my test clusters are using EC-backed .rgw.buckets pools; one is 7+2, the other is 3+1. Both clusters are running 0.94.6-1 with those patches (except the "unpatched" RGW, which is 0.94.6-1 with the 15745 patch).

RGW is using a 4MiB stripe size (the default setting, I believe). The chunk size is the multipart chunk size, which is configurable via your s3 client. In s3cmd, this is via the --multipart-chunk-size-mb flag. In my tests, I am setting the s3 chunk size to 5MiB. When I write a multipart object with a patched rgw, though, 'radosgw-admin object stat' shows that the "part_size" is 4MiB instead of the expected 5 MiB. It also sets the "part_size" of the final part to 0 instead of whatever the expected remainder should be.

#33 Updated by Yehuda Sadeh over 2 years ago

Brian, when you specify 0.94.6-1, do you mean vanilla 0.94.6, or some other distro's version? Commit b7a7d41839558b1a8786a66e27d9d80e606f61c7 cannot be applied cleanly in hammer, I assume you took another commit (maybe the one that I specified)?
Compare your branch to this branch https://github.com/yehudasa/ceph/tree/wip-15886-hammer-test, so that we're sure we're talking about the same code base.

#34 Updated by Brian Felton over 2 years ago

Yehuda,

I appreciate the sanity check here. I had pulled the correct patch for 15745, but I definitely hosed up what I merged in for this issue. I'm going to fix this, kick off a build, and test again in the morning. My apologies for any false alarms.

#35 Updated by Brian Felton over 2 years ago

I've recompiled with the correct code, and things are looking good.

Also available in: Atom PDF