Project

General

Profile

Bug #13764

Radosgw incomplete files

Added by George Mihaiescu over 8 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
other
Tags:
radosgw
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

We have a Ceph cluster running on Ubuntu 14.04 with Hammer 0.94.5-1trusty that is used primarily to storage large genomics files in S3. We have a custom upload client that uploads in 1 GB parts and the Radosgw servers (three of them behind haproxy) use a stripe size of 64 MB.

Our custom upload client uploads the data in "bucket_name/data" but also creates zero bytes files with the same name in "bucket_name/upload" for keeping the state of the upload, which we later delete from the "upload" pseudo-folder.

After we uploaded 112 TB of files (replica 3) we initiated a QC process where we download the file and check its md5sum and slice it (read parts of it).

Most files were fine, except for 6 of them which have missing parts. The object size reported by radosgw-admin for the broken S3 files is the correct size the object should have:

root@controller1:~# radosgw-admin object stat --bucket=bucket_name --object=data/16a029f6-5b18-58da-be08-3fccbc64946c| grep obj_size
"obj_size": 86027596254,

If I check the object size with the Aws CLI client the size is the same:
$ aws --profile coll --endpoint-url https://xxx s3 ls s3://bucket_name/data/16a029f6-5b18-58da-be08-3fccbc64946c
2015-10-22 04:21:48 86027596254 16a029f6-5b18-58da-be08-3fccbc64946c

If I check rados for objects matching the prefix, I get a large number of shadow files and fewer files containing the string "multipart".

root@controller1:~# grep -c shadow 16a029f6-5b18-58da-be08-3fccbc64946c_rados_obj
1201

root@controller1:~# grep multipart -c 16a029f6-5b18-58da-be08-3fccbc64946c_rados_obj
81

I checked the OSD for one of the rados objects which seem to not be found by Ceph, and the file is there, but it's hard to see in the logs what are the missing parts that Ceph is complaining about.

This is a snippet of the Radosgw log:

There are errors in radosgw logs but I don't understand what they mean

2015-11-09 12:03:51.258857 7f7cfd7fa700 1 -- 172.25.12.17:0/2518825 --> 172.25.12.12:6852/5663 -- osd_op(client.42408795.0:250 default.34461213.1__multipart_data/16a029f6-5b18-58da-be08
3fccbc64946c.2~5f2SyGnw9S9oBXdPVkzyll97ZFQ739k.1 [read 41943040~4194304] 25.9abfe22a ack+read+known_if_redirected e19660) v5 - ?+0 0x7f7ce0014a00 con 0x7f7cdc042530
2015-11-09 12:03:51.258874 7f7cfd7fa700 20 rados->aio_operate r=0 bl.length=0
2015-11-09 12:03:51.258883 7f7cfd7fa700 20 rados->get_obj_iterate_cb oid=default.34461213.1__multipart_data/16a029f6-5b18-58da-be08-3fccbc64946c.2~5f2SyGnw9S9oBXdPVkzyll97ZFQ739k.1 obj-o
fs=46137344 read_ofs=46137344 len=4194304
2015-11-09 12:03:51.258912 7f7cfd7fa700 1 -- 172.25.12.17:0/2518825 --> 172.25.12.12:6852/5663 -- osd_op(client.42408795.0:251 default.34461213.1__multipart_data/16a029f6-5b18-58da-be08
3fccbc64946c.2~5f2SyGnw9S9oBXdPVkzyll97ZFQ739k.1 [read 46137344~4194304] 25.9abfe22a ack+read+known_if_redirected e19660) v5 - ?+0 0x7f7ce0015680 con 0x7f7cdc042530
2015-11-09 12:03:51.258927 7f7cfd7fa700 20 rados->aio_operate r=0 bl.length=0
2015-11-09 12:03:51.258931 7f7cfd7fa700 20 RGWObjManifest::operator++(): rule->part_size=1073741824 rules.size()=2
2015-11-09 12:03:51.258933 7f7cfd7fa700 20 RGWObjManifest::operator++(): stripe_ofs=67108864 part_ofs=0 rule->part_size=1073741824
2015-11-09 12:03:51.258937 7f7cfd7fa700 0 RGWObjManifest::operator++(): result: ofs=67108864 stripe_ofs=67108864 part_ofs=0 rule->part_size=1073741824
2015-11-09 12:03:51.259221 7f7bdc2f3700 1 -- 172.25.12.17:0/2518825 <== osd.56 172.25.12.12:6852/5663 1 ==== osd_op_reply(244 default.34461213.1__multipart_data/16a029f6-5b18-58da-be08-
3fccbc64946c.2~5f2SyGnw9S9oBXdPVkzyll97ZFQ739k.1 [read 50331648~0] v0'0 uv28946 ondisk = 0) v6 ==== 274+0+0 (1420548869 0 0) 0x7f7a68000940 con 0x7f7cdc042530
2015-11-09 12:03:51.259317 7f7d63fff700 20 get_obj_aio_completion_cb: io completion ofs=50331648 len=4194304
2015-11-09 12:03:51.259525 7f7cfdffb700 0 ERROR: flush_read_list(): d->client_c->handle_data() returned -5
2015-11-09 12:03:51.259535 7f7cfdffb700 10 get_obj_iterate() r=-5, canceling all io
2015-11-09 12:03:51.259537 7f7cfdffb700 20 get_obj_data::cancel_all_io()
2015-11-09 12:03:51.259541 7f7cfdffb700 0 WARNING: set_req_state_err err_no=5 resorting to 500
2015-11-09 12:03:51.259645 7f7cfdffb700 2 req 1:0.045790:s3:GET /oicr.icgc/data/16a029f6-5b18-58da-be08-3fccbc64946c:get_obj:http status=500
2015-11-09 12:03:51.259656 7f7cfdffb700 1 ====== req done req=0x7f7cdc004e00 http_status=500 ======

root@storage6-r2:/var/lib/ceph/osd/ceph-443/current/25.1f47_head# ls l | grep odmj73MQjOY4tAqaLdk7HKIpPfo14i5.109
-rw-r--r-
1 root root 67108864 Oct 8 09:31 default.34461213.1\u\ushadow\udata\s9f65bfd1-0846-55ef-9043-4bfa0bc3fdef.2~odmj73MQjOY4tAqaLdk7HKIpPfo14i5.109\u8__head_85419F47__19

While researching this issue I found this older bug report (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001408.html) but we run a version of Ceph where the patch is applied, so our issue should be a different one.
Is there a way we can troubleshoot this better?

If needed, I can provide more logs.

Thank you,
George

radosgw-controller1.log View (58.2 KB) George Mihaiescu, 11/13/2015 03:07 AM


Related issues

Related to rgw - Bug #15886: Multipart Object Corruption Resolved 05/13/2016

History

#1 Updated by Yehuda Sadeh over 8 years ago

Looking at this log snippet, I can't really tell what's happening. It seems that we try sending back data to the client but get -EIO (-5). Maybe there was another error previously that got the client to disconnect?

#2 Updated by George Mihaiescu over 8 years ago

Thank you for looking at this Yehuda.

I uploaded the last 500 log lines, but I can reproduce the error and provide new logs if needed.

Also, as part of the same QC process we encountered some other files today that cannot be downloaded although they show up in the bucket index with the correct size.

The files in this case are smaller, so I managed to find the rados object for one small file that cannot be downloaded, but when I looked on the three OSDs the rados object is size 0.

What is weird is that while looking on these OSDs I noticed other rados bjects that have size 0, so I tried to download the S3 object they belonged to and it worked. The only difference I can see is that the S3 object with zero size rados object also has an object containing "multipart" in its name.

For example, this is a bad S3 object that cannot be downloaded:

ubuntu@os-client-6:/local$ aws --profile coll --endpoint-url https://xxx s3 cp s3://oicr.icgc/data/25e2943e-e5e5-5c17-99f3-b489836dd348 .
download failed: s3://oicr.icgc/data/25e2943e-e5e5-5c17-99f3-b489836dd348 to ./25e2943e-e5e5-5c17-99f3-b489836dd348 A client error (NoSuchKey) occurred when calling the GetObject operation: Unknown

The bucket index has the right size for the file, so the file must have been seen entirely by radosgw:
root@controller1:/tmp# radosgw-admin object stat --bucket=oicr.icgc --object=data/25e2943e-e5e5-5c17-99f3-b489836dd348 {
"name": "data\/25e2943e-e5e5-5c17-99f3-b489836dd348",
"size": 1377812,
"policy": {
"acl": {
"acl_user_map": [ {
"user": "YYY",
"acl": 15
}
],
"acl_group_map": [],
"grant_map": [ {
"id": "YYY",
"grant": {
"type": {
"type": 0
},
"id": "YYY",
"email": "",
"permission": {
"flags": 15
},
"name": "XXX",
"group": 0
}
}
]
},
"owner": {
"id": "YYY",
"display_name": "XXX"
}
},
"etag": "de7afa4229643e344068ae9e3e47d153-1\u0000",
"tag": "default.40302537.1886793\u0000",
"manifest": {
"objs": [],
"obj_size": 1377812,
"explicit_objs": "false",
"head_obj": {
"bucket": {
"name": "oicr.icgc",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.34461213.1",
"bucket_id": "default.34461213.1"
},
"key": "",
"ns": "",
"object": "data\/25e2943e-e5e5-5c17-99f3-b489836dd348",
"instance": ""
},
"head_size": 0,
"max_head_size": 0,
"prefix": "data\/25e2943e-e5e5-5c17-99f3-b489836dd348.2~Os5X4vbqfr79BdwAzoCJ8S-oq9mEP3K",
"tail_bucket": {
"name": "oicr.icgc",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.34461213.1",
"bucket_id": "default.34461213.1"
},
"rules": [ {
"key": 0,
"val": {
"start_part_num": 1,
"start_ofs": 0,
"part_size": 0,
"stripe_max_size": 67108864,
"override_prefix": ""
}
}
]
},
"attrs": {
"user.rgw.content_type": "application\/x-www-form-urlencoded; charset=utf-8\u0000"
}
}

I exported all rados objects in the pool to a file to search easier (there are 1.9 mil objects):
root@controller1:~# grep 25e2943e-e5e5-5c17-99f3-b489836dd348 bucket_contents.txt
default.34461213.1_data/25e2943e-e5e5-5c17-99f3-b489836dd348 -> zero size file on all three OSDs

The other file that also has a zero size rados object has an additional object with "_multipart" in its name which actually contains the data:
root@controller1:~# grep bbcbf57d-bdca-5ed2-b4fe-f1230bd7e5d7 bucket_contents.txt
default.34461213.1_data/bbcbf57d-bdca-5ed2-b4fe-f1230bd7e5d7
default.34461213.1__multipart_data/bbcbf57d-bdca-5ed2-b4fe-f1230bd7e5d7.2~mYnsvagOkIj59F4SvPSbBXqo-5z8tPh.1

We need to find out what condition causes these sporadic errors before we upload more data because the QC process is not possible for files we do not upload directly, like files that are downloaded by other users who expect to be able to retrieve them when needed.

Thank you very much for support,
George

#3 Updated by George Mihaiescu over 8 years ago

Hi,

I can provide more info about this issue if it helps troubleshooting.

We have rgw_obj_stripe_size = 67108864 in ceph.conf and the client uploads in 1 GB parts.

Regarding the six large BAM files we cannot download, they expose different behaviours.

These four BAMs download partial objects which the AWS S3 client cleans up when the download fails:
16a029f6-5b18-58da-be08-3fccbc64946c, "obj_size" reported by radosgw-admin 86027596254, shadow files 1216, multipart files: 82
9f65bfd1-0846-55ef-9043-4bfa0bc3fdef, "obj_size" reported by radosgw-admin 117918990350, shadow files 3266, multipart files: 218
58139353-5c8b-52e7-967a-4fe02a59159d "obj_size" reported by radosgw-admin 151239003780, shadow files 2128, multipart files: 142
ea320e35-11be-5231-881b-b4e283789d30, "obj_size" reported by radosgw-admin 143640709050, shadow files 2139, multipart files: 134

These two uploaded objects have no Rados objects with "shadow" or "multipart" in their names, and the download fails immediately:
1645434a-b6c6-5d96-a252-cdb5ae1c5d20, "obj_size" reported by radosgw-admin 114675769103, shadow files 0, multipart files: 0
534acdce-ffda-5aef-ad77-3fabfb9317d5, "obj_size" reported by radosgw-admin 90328301465, shadow files 0, multipart files: 0

ubuntu@os-client-6:/local$ aws --profile coll --endpoint-url https://XXX s3api get-object --bucket oicr.icgc --key data/1645434a-b6c6-5d96-a252-cdb5ae1c5d20 1645434a-b6c6-5d96-a252-cdb5ae1c5d20
A client error (NoSuchKey) occurred when calling the GetObject operation: Unknown

ubuntu@os-client-6:/local$ aws --profile coll --endpoint-url https://XXX s3api get-object --bucket oicr.icgc --key data/534acdce-ffda-5aef-ad77-3fabfb9317d5 534acdce-ffda-5aef-ad77-3fabfb9317d5
A client error (NoSuchKey) occurred when calling the GetObject operation: Unknown

From the radosgw log it looks like it maps the missing parts to OSDs but the data doesn't exist there:

2015-11-17 12:10:03.771421 7f79f8ff9700 20 get_obj_state: rctx=0x7f79f8ff4160 obj=oicr.icgc:data/534acdce-ffda-5aef-ad77-3fabfb9317d5 state=0x7f79dc0296e0 s->prefetch_data=1
2015-11-17 12:10:03.771468 7f79f8ff9700 20 rados->get_obj_iterate_cb oid=default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns.48_7 obj-ofs=50952404992 read_ofs=16777216 l
en=4194304
2015-11-17 12:10:03.771615 7f79f8ff9700 1 -- 172.25.12.17:0/4923329 --> 172.25.12.64:6860/6690 -- osd_op(client.37645678.0:40 default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtX
m5WkXrHns.48_7 [read 16777216~4194304] 25.3470418f ack+read+known_if_redirected e20032) v5 -- ?+0 0x7f79dc043910 con 0x7f79dc0424d0
2015-11-17 12:10:03.771669 7f79f8ff9700 20 rados->aio_operate r=0 bl.length=0
2015-11-17 12:10:03.771689 7f79f8ff9700 20 rados->get_obj_iterate_cb oid=default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns.48_7 obj-ofs=50956599296 read_ofs=20971520 l
en=4194304
2015-11-17 12:10:03.771738 7f79f8ff9700 1 -- 172.25.12.17:0/4923329 --> 172.25.12.64:6860/6690 -- osd_op(client.37645678.0:41 default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtX
m5WkXrHns.48_7 [read 20971520~4194304] 25.3470418f ack+read+known_if_redirected e20032) v5 -- ?+0 0x7f79dc044970 con 0x7f79dc0424d0
2015-11-17 12:10:03.771768 7f79f8ff9700 20 rados->aio_operate r=0 bl.length=0
2015-11-17 12:10:03.771779 7f79f8ff9700 20 rados->get_obj_iterate_cb oid=default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns.48_7 obj-ofs=50960793600 read_ofs=25165824 l
en=4194304
2015-11-17 12:10:03.771807 7f79f8ff9700 1 -- 172.25.12.17:0/4923329 --> 172.25.12.64:6860/6690 -- osd_op(client.37645678.0:42 default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtX
m5WkXrHns.48_7 [read 25165824~4194304] 25.3470418f ack+read+known_if_redirected e20032) v5 -- ?+0 0x7f79dc045920 con 0x7f79dc0424d0
2015-11-17 12:10:03.771828 7f79f8ff9700 20 rados->aio_operate r=0 bl.length=0
2015-11-17 12:10:03.771835 7f79f8ff9700 20 rados->get_obj_iterate_cb oid=default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns.48_7 obj-ofs=50964987904 read_ofs=29360128 l
en=4194304
2015-11-17 12:10:03.771861 7f79f8ff9700 1 -- 172.25.12.17:0/4923329 --> 172.25.12.64:6860/6690 -- osd_op(client.37645678.0:43 default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtX
m5WkXrHns.48_7 [read 29360128~4194304] 25.3470418f ack+read+known_if_redirected e20032) v5 -- ?+0 0x7f79dc0468d0 con 0x7f79dc0424d0
2015-11-17 12:10:03.771881 7f79f8ff9700 20 rados->aio_operate r=0 bl.length=0
2015-11-17 12:10:03.771884 7f79f8ff9700 20 RGWObjManifest::operator++(): rule->part_size=1073741824 rules.size()=2
2015-11-17 12:10:03.771886 7f79f8ff9700 20 RGWObjManifest::operator++(): stripe_ofs=51002736640 part_ofs=50465865728 rule->part_size=1073741824
2015-11-17 12:10:03.771888 7f79f8ff9700 0 RGWObjManifest::operator++(): result: ofs=51002736640 stripe_ofs=51002736640 part_ofs=50465865728 rule->part_size=1073741824
2015-11-17 12:10:03.774162 7f79bc5f6700 1 -- 172.25.12.17:0/4923329 <== osd.267 172.25.12.64:6860/6690 1 ==== osd_op_reply(40 default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns.48_7 [read 16777216~4194304] v0'0 uv0 ack = 2 ((2) No such file or directory)) v6 ==== 274+0+0 (3469143611 0 0) 0x7f77cc000b40 con 0x7f79dc0424d0
2015-11-17 12:10:03.774274 7f7a62ffd700 20 get_obj_aio_completion_cb: io completion ofs=50952404992 len=4194304
2015-11-17 12:10:03.774320 7f7a62ffd700 0 ERROR: got unexpected error when trying to read object: -2
2015-11-17 12:10:03.774337 7f79bc5f6700 1 -
172.25.12.17:0/4923329 <== osd.267 172.25.12.64:6860/6690 2 ==== osd_op_reply(41 default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns.48_7 [read 20971520~4194304] v0'0 uv0 ack = 2 ((2) No such file or directory)) v6 ==== 274+0+0 (1014961918 0 0) 0x7f77cc000b40 con 0x7f79dc0424d0
2015-11-17 12:10:03.774356 7f79f8ff9700 10 get_obj_iterate() r=-2, canceling all io
2015-11-17 12:10:03.774367 7f79f8ff9700 20 get_obj_data::cancel_all_io()
2015-11-17 12:10:03.774381 7f7a62ffd700 20 get_obj_aio_completion_cb: io completion ofs=50956599296 len=4194304
2015-11-17 12:10:03.774384 7f7a62ffd700 0 ERROR: got unexpected error when trying to read object: -2
2015-11-17 12:10:03.774426 7f79bc5f6700 1 -
172.25.12.17:0/4923329 <== osd.267 172.25.12.64:6860/6690 3 ==== osd_op_reply(42 default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns.48_7 [read 25165824~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 274+0+0 (777611584 0 0) 0x7f77cc000b40 con 0x7f79dc0424d0
2015-11-17 12:10:03.774462 7f7a62ffd700 20 get_obj_aio_completion_cb: io completion ofs=50960793600 len=4194304
2015-11-17 12:10:03.774473 7f7a62ffd700 0 ERROR: got unexpected error when trying to read object: -2

root@controller1:~# ceph osd map .rgw.buckets default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns.48_7
osdmap e20032 pool '.rgw.buckets' (25) object 'default.34461213.1__shadow_data/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns.48_7' -> pg 25.3470418f (25.18f) -> up ([267,382,13], p267) acting ([267,382,13], p267)

I checked on OSD.267 and there is no file in PG 25.18f directory:
root@storage3-r2:/var/lib/ceph/osd/ceph-267/current/25.18f_head# ls -l | grep 34acdce-ffda-5aef-ad77-3fabfb9317d5

I checked the logs and all scrubs and deep-scrubs for that PG have been clean too.

Thank you,
George

#4 Updated by Yehuda Sadeh over 8 years ago

For 534acdce-ffda-5aef-ad77-3fabfb9317d5, what does the manifest (radosgw-admin object stats) show?

#5 Updated by George Mihaiescu over 8 years ago

It shows normal to me (and the reported size too):

root@controller1:~# radosgw-admin object stat --bucket=oicr.icgc --object=data/534acdce-ffda-5aef-ad77-3fabfb9317d5 {
"name": "data\/534acdce-ffda-5aef-ad77-3fabfb9317d5",
"size": 90328301465,
"policy": {
"acl": {
"acl_user_map": [ {
"user": "db45084c7fc2445593dca9ecec97a2f1",
"acl": 15
}
],
"acl_group_map": [],
"grant_map": [ {
"id": "db45084c7fc2445593dca9ecec97a2f1",
"grant": {
"type": {
"type": 0
},
"id": "db45084c7fc2445593dca9ecec97a2f1",
"email": "",
"permission": {
"flags": 15
},
"name": "XXX",
"group": 0
}
}
]
},
"owner": {
"id": "db45084c7fc2445593dca9ecec97a2f1",
"display_name": "XXX"
}
},
"etag": "c958f0c6440b059c86c1849711e5b349-85\u0000",
"tag": "default.37463128.375609\u0000",
"manifest": {
"objs": [],
"obj_size": 90328301465,
"explicit_objs": "false",
"head_obj": {
"bucket": {
"name": "oicr.icgc",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.34461213.1",
"bucket_id": "default.34461213.1"
},
"key": "",
"ns": "",
"object": "data\/534acdce-ffda-5aef-ad77-3fabfb9317d5",
"instance": ""
},
"head_size": 0,
"max_head_size": 0,
"prefix": "data\/534acdce-ffda-5aef-ad77-3fabfb9317d5.2~eHYUCpxpPIQn3HrdqbFYtXm5WkXrHns",
"tail_bucket": {
"name": "oicr.icgc",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.34461213.1",
"bucket_id": "default.34461213.1"
},
"rules": [ {
"key": 0,
"val": {
"start_part_num": 1,
"start_ofs": 0,
"part_size": 1073741824,
"stripe_max_size": 67108864,
"override_prefix": ""
}
}, {
"key": 90194313216,
"val": {
"start_part_num": 85,
"start_ofs": 90194313216,
"part_size": 133988249,
"stripe_max_size": 67108864,
"override_prefix": ""
}
}
]
},
"attrs": {
"user.rgw.content_type": "application\/x-www-form-urlencoded; charset=utf-8\u0000"
}
}

#6 Updated by George Mihaiescu over 8 years ago

Hi Yehuda,

Is there any update on this bug report? I know that we hit a pretty unusual issue but I'm hoping we can at least know what caused it and how to avoid or workaround it in the future.

Thank you,
George

#7 Updated by Yehuda Sadeh over 8 years ago

The manifest is normal. The object stripe seem to be missing. There's not much else that I can see in here. Do you have the upload log for that specific object? (or any other object that have a similar issue)

#8 Updated by George Mihaiescu almost 8 years ago

We uploaded more than 500 TB in the cluster, and then we downloaded the data and ran md5sum on it to confirm it matches the original and we only found a few issues, probably caused by overlapping uploads.

We have very large genomics files (~200 GB) that take a long time to upload, and in a very few cases two automated upload clients started uploading the same file at short intervals. Both uploads completed successfully, but the resulting file was corrupted.

#9 Updated by Yehuda Sadeh almost 8 years ago

I'd really like to get this one figured out. Do you happen to have any logs for the uploaded objects?

#10 Updated by George Mihaiescu almost 8 years ago

I have logs for a more recent upload corruption case, but I'll send them directly to you because they contain sensitive data.

Thank you,
George

#11 Updated by Nathan Cutler almost 8 years ago

  • Related to Bug #15886: Multipart Object Corruption added

#12 Updated by Casey Bodley about 4 years ago

  • Status changed from New to Closed

Also available in: Atom PDF