Project

General

Profile

Actions

Bug #13764

closed

Radosgw incomplete files

Added by George Mihaiescu over 8 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
other
Tags:
radosgw
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

We have a Ceph cluster running on Ubuntu 14.04 with Hammer 0.94.5-1trusty that is used primarily to storage large genomics files in S3. We have a custom upload client that uploads in 1 GB parts and the Radosgw servers (three of them behind haproxy) use a stripe size of 64 MB.

Our custom upload client uploads the data in "bucket_name/data" but also creates zero bytes files with the same name in "bucket_name/upload" for keeping the state of the upload, which we later delete from the "upload" pseudo-folder.

After we uploaded 112 TB of files (replica 3) we initiated a QC process where we download the file and check its md5sum and slice it (read parts of it).

Most files were fine, except for 6 of them which have missing parts. The object size reported by radosgw-admin for the broken S3 files is the correct size the object should have:

root@controller1:~# radosgw-admin object stat --bucket=bucket_name --object=data/16a029f6-5b18-58da-be08-3fccbc64946c| grep obj_size
"obj_size": 86027596254,

If I check the object size with the Aws CLI client the size is the same:
$ aws --profile coll --endpoint-url https://xxx s3 ls s3://bucket_name/data/16a029f6-5b18-58da-be08-3fccbc64946c
2015-10-22 04:21:48 86027596254 16a029f6-5b18-58da-be08-3fccbc64946c

If I check rados for objects matching the prefix, I get a large number of shadow files and fewer files containing the string "multipart".

root@controller1:~# grep -c shadow 16a029f6-5b18-58da-be08-3fccbc64946c_rados_obj
1201

root@controller1:~# grep multipart -c 16a029f6-5b18-58da-be08-3fccbc64946c_rados_obj
81

I checked the OSD for one of the rados objects which seem to not be found by Ceph, and the file is there, but it's hard to see in the logs what are the missing parts that Ceph is complaining about.

This is a snippet of the Radosgw log:

There are errors in radosgw logs but I don't understand what they mean

2015-11-09 12:03:51.258857 7f7cfd7fa700 1 -- 172.25.12.17:0/2518825 --> 172.25.12.12:6852/5663 -- osd_op(client.42408795.0:250 default.34461213.1__multipart_data/16a029f6-5b18-58da-be08
3fccbc64946c.2~5f2SyGnw9S9oBXdPVkzyll97ZFQ739k.1 [read 41943040~4194304] 25.9abfe22a ack+read+known_if_redirected e19660) v5 - ?+0 0x7f7ce0014a00 con 0x7f7cdc042530
2015-11-09 12:03:51.258874 7f7cfd7fa700 20 rados->aio_operate r=0 bl.length=0
2015-11-09 12:03:51.258883 7f7cfd7fa700 20 rados->get_obj_iterate_cb oid=default.34461213.1__multipart_data/16a029f6-5b18-58da-be08-3fccbc64946c.2~5f2SyGnw9S9oBXdPVkzyll97ZFQ739k.1 obj-o
fs=46137344 read_ofs=46137344 len=4194304
2015-11-09 12:03:51.258912 7f7cfd7fa700 1 -- 172.25.12.17:0/2518825 --> 172.25.12.12:6852/5663 -- osd_op(client.42408795.0:251 default.34461213.1__multipart_data/16a029f6-5b18-58da-be08
3fccbc64946c.2~5f2SyGnw9S9oBXdPVkzyll97ZFQ739k.1 [read 46137344~4194304] 25.9abfe22a ack+read+known_if_redirected e19660) v5 - ?+0 0x7f7ce0015680 con 0x7f7cdc042530
2015-11-09 12:03:51.258927 7f7cfd7fa700 20 rados->aio_operate r=0 bl.length=0
2015-11-09 12:03:51.258931 7f7cfd7fa700 20 RGWObjManifest::operator++(): rule->part_size=1073741824 rules.size()=2
2015-11-09 12:03:51.258933 7f7cfd7fa700 20 RGWObjManifest::operator++(): stripe_ofs=67108864 part_ofs=0 rule->part_size=1073741824
2015-11-09 12:03:51.258937 7f7cfd7fa700 0 RGWObjManifest::operator++(): result: ofs=67108864 stripe_ofs=67108864 part_ofs=0 rule->part_size=1073741824
2015-11-09 12:03:51.259221 7f7bdc2f3700 1 -- 172.25.12.17:0/2518825 <== osd.56 172.25.12.12:6852/5663 1 ==== osd_op_reply(244 default.34461213.1__multipart_data/16a029f6-5b18-58da-be08-
3fccbc64946c.2~5f2SyGnw9S9oBXdPVkzyll97ZFQ739k.1 [read 50331648~0] v0'0 uv28946 ondisk = 0) v6 ==== 274+0+0 (1420548869 0 0) 0x7f7a68000940 con 0x7f7cdc042530
2015-11-09 12:03:51.259317 7f7d63fff700 20 get_obj_aio_completion_cb: io completion ofs=50331648 len=4194304
2015-11-09 12:03:51.259525 7f7cfdffb700 0 ERROR: flush_read_list(): d->client_c->handle_data() returned -5
2015-11-09 12:03:51.259535 7f7cfdffb700 10 get_obj_iterate() r=-5, canceling all io
2015-11-09 12:03:51.259537 7f7cfdffb700 20 get_obj_data::cancel_all_io()
2015-11-09 12:03:51.259541 7f7cfdffb700 0 WARNING: set_req_state_err err_no=5 resorting to 500
2015-11-09 12:03:51.259645 7f7cfdffb700 2 req 1:0.045790:s3:GET /oicr.icgc/data/16a029f6-5b18-58da-be08-3fccbc64946c:get_obj:http status=500
2015-11-09 12:03:51.259656 7f7cfdffb700 1 ====== req done req=0x7f7cdc004e00 http_status=500 ======

root@storage6-r2:/var/lib/ceph/osd/ceph-443/current/25.1f47_head# ls l | grep odmj73MQjOY4tAqaLdk7HKIpPfo14i5.109
-rw-r--r-
1 root root 67108864 Oct 8 09:31 default.34461213.1\u\ushadow\udata\s9f65bfd1-0846-55ef-9043-4bfa0bc3fdef.2~odmj73MQjOY4tAqaLdk7HKIpPfo14i5.109\u8__head_85419F47__19

While researching this issue I found this older bug report (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001408.html) but we run a version of Ceph where the patch is applied, so our issue should be a different one.
Is there a way we can troubleshoot this better?

If needed, I can provide more logs.

Thank you,
George


Files

radosgw-controller1.log (58.2 KB) radosgw-controller1.log George Mihaiescu, 11/13/2015 03:07 AM

Related issues 1 (0 open1 closed)

Related to rgw - Bug #15886: Multipart Object CorruptionResolved05/13/2016

Actions
Actions

Also available in: Atom PDF