Project

General

Profile

Bug #47383

Multipart uploads fail when rgw_obj_stripe_size is configured to be larger than the default 4MiB

Added by Jared Baker 10 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Submitting this bug report as per request from Matt Benjamin in ceph-users thread https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RUH7Z44GL7TMATASP5BNLSZKVZSJFTZ2/

Since upgrading our cluster to Ceph Nautilus 14.2.11 we noticed that multipart uploads from our custom application that uses the AWS Java SDK to perform configurable multipart uploads was failing and radosgw was returning status 500 for those uploads. Non-multipart uploads worked fine and so did multipart uploads with partsizes SMALLER than 16MiB. We isolated the issue down to 16MiB or larger part sizes and further isolated the issue to our ceph config where we were setting rgw_obj_stripe_size to 67108864 bytes (64MiB). Commenting out this config parameter + bouncing the radosgw service allowed us to use large partsizes (sometimes we use 2GB partsizes for large genomic files). I also tried setting rgw_obj_stripe_size to 20MiB but this still caused multipart uploads with larger partsizes to fail. We were able to reproduce this problem and workaround in our lab ceph cluster which runs Nautilus 14.2.10.

If you want to test this yourself, configure the aws cli config to use a larger partsize (8MiB is the default and will not reproduce the problem) with something like:

s3 =
multipart_chunksize = 32MB

rgw server logs during a failed multipart upload (32MB chunk/partsize):
2020-09-08 15:59:36.054 7f2d32fa6700 1 ====== starting new request
req=0x55953dc36930 =====
2020-09-08 15:59:36.082 7f2d32fa6700 -1 res_query() failed
2020-09-08 15:59:36.138 7f2d32fa6700 1 ====== req done
req=0x55953dc36930 op status=0 http_status=200 latency=0.0839988s ======
2020-09-08 16:00:07.285 7f2d3dfbc700 1 ====== starting new request
req=0x55953dc36930 =====
2020-09-08 16:00:07.285 7f2d3dfbc700 -1 res_query() failed
2020-09-08 16:00:07.353 7f2d00741700 1 ====== starting new request
req=0x55954dd5e930 =====
2020-09-08 16:00:07.357 7f2d00741700 -1 res_query() failed
2020-09-08 16:00:07.413 7f2cc56cb700 1 ====== starting new request
req=0x55953dc02930 =====
2020-09-08 16:00:07.417 7f2cc56cb700 -1 res_query() failed
2020-09-08 16:00:07.473 7f2cb26a5700 1 ====== starting new request
req=0x5595426f6930 =====
2020-09-08 16:00:07.473 7f2cb26a5700 -1 res_query() failed
2020-09-08 16:00:09.465 7f2d3dfbc700 0 WARNING: set_req_state_err
err_no=35 resorting to 500
2020-09-08 16:00:09.465 7f2d3dfbc700 1 ====== req done
req=0x55953dc36930 op status=-35 http_status=500 latency=2.17997s ======
2020-09-08 16:00:09.549 7f2d00741700 0 WARNING: set_req_state_err
err_no=35 resorting to 500
2020-09-08 16:00:09.549 7f2d00741700 1 ====== req done
req=0x55954dd5e930 op status=-35 http_status=500 latency=2.19597s ======
2020-09-08 16:00:09.605 7f2cc56cb700 0 WARNING: set_req_state_err
err_no=35 resorting to 500
2020-09-08 16:00:09.609 7f2cc56cb700 1 ====== req done
req=0x55953dc02930 op status=-35 http_status=500 latency=2.19597s ======
2020-09-08 16:00:09.641 7f2cb26a5700 0 WARNING: set_req_state_err
err_no=35 resorting to 500
2020-09-08 16:00:09.641 7f2cb26a5700 1 ====== req done
req=0x5595426f6930 op status=-35 http_status=500 latency=2.16797s ======

awscli client side output during a failed multipart upload:
root@jump:~# aws --no-verify-ssl --endpoint-url
http://lab-object.cancercollaboratory.org:7480 s3 cp 4GBfile
s3://troubleshooting
upload failed: ./4GBfile to s3://troubleshooting/4GBfile An error
occurred (UnknownError) when calling the UploadPart operation (reached
max retries: 2): Unknown

Also available in: Atom PDF