Bug #50141
closedconsecutive complete-multipart-upload requests following the 1st (successful) one on the same upload-id result in 500
100%
Description
when the 200 OK of a complete_multipart request gets lost in network transmission to the client (quay geo-replicated zone),
the client will attempt re-sending complete_multipart requests which will result in 500 response,
until after 3 attempts it will disable the RGW storage backend and stop functioning.
rgw log:
2021-03-06 03:08:35.173315 7f87b915b700 0 RGWCompleteMultipart::execute() failed to acquire lock 2021-03-06 03:08:35.173341 7f87b915b700 2 req 17767:0.001860:s3:POST /xxx-registry/sha256/14/142835ba5a0abf568fdde5490fae18b16ca4eca56a66e45136b7065428564e63:complete_multipart:completing 2021-03-06 03:08:35.173426 7f87b915b700 2 req 17767:0.001945:s3:POST /xxx-registry/sha256/14/142835ba5a0abf568fdde5490fae18b16ca4eca56a66e45136b7065428564e63:complete_multipart:op status=-2200 2021-03-06 03:08:35.173437 7f87b915b700 2 req 17767:0.001956:s3:POST /xxx-registry/sha256/14/142835ba5a0abf568fdde5490fae18b16ca4eca56a66e45136b7065428564e63:complete_multipart:http status=500 2021-03-06 03:08:35.173443 7f87b915b700 1 ====== req done req=0x7f87b9154f70 op status=-2200 http_status=500 ======
Updated by Mark Kogan about 3 years ago
reproducing below a flow that is similar to a Quay with geo-replication zone configuration using aws cli:
1st multipart-complete:
dd if=/dev/zero of=container.img bs=1M count=4 aws --endpoint-url http://localhost:8000 s3 mb s3://alln-registry aws --endpoint-url http://localhost:8000 s3api create-multipart-upload --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' | jq { "Bucket": "alln-registry", "Key": "datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f", "UploadId": "2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w" } export UPID='2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w' aws --endpoint-url http://localhost:8000 s3api upload-part --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' --part-number 1 --body ./container.img --upload-id "${UPID}" { "ETag": "\"b5cfa9d6c8febd618f91ac2843d50a1c\"" } aws --endpoint-url http://localhost:8000 s3api list-parts --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' --upload-id "${UPID}" { "Parts": [ { "PartNumber": 1, "LastModified": "2021-04-07T11:50:59.516Z", "ETag": "\"b5cfa9d6c8febd618f91ac2843d50a1c\"", "Size": 4194304 } ], "Initiator": null, "Owner": { "DisplayName": "", "ID": "" }, "StorageClass": "STANDARD" } cat <<_EOF > fileparts.json { "Parts": [{ "ETag": "b5cfa9d6c8febd618f91ac2843d50a1c", "PartNumber":1 }] } _EOF aws --endpoint-url http://localhost:8000 s3api complete-multipart-upload --multipart-upload file://fileparts.json --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' --upload-id "${UPID}"
in gdb breakpoint at src/rgw/rgw_op.cc : RGWCompleteMultipart::execute()
serializer = meta_obj->get_serializer("RGWCompleteMultipart"); op_ret = serializer->try_lock(dur, y); Thread 99 "civetweb-worker" hit Breakpoint 2, RGWCompleteMultipart::execute (this=0x7ffed0016640, y=...) at ../src/rgw/rgw_op.cc:5956 5956 if (op_ret < 0) { (gdb) p op_ret $1 = 0
on the 1st multipart-complete request the execution flows as previously and will not enter new flow (does not check for the existence of the first part object before multipart completes)
2nd multipart-complete - when the 1st complete-multipart 200 OK gets lost in transmission (network problem occured), Quay will re-send a complete-multipart request:
aws --endpoint-url http://localhost:8000 s3api complete-multipart-upload --multipart-upload file://fileparts.json --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' --upload-id "${UPID}"
in gdb breakpoint at src/rgw/rgw_op.cc : RGWCompleteMultipart::execute()
serializer = meta_obj->get_serializer("RGWCompleteMultipart"); op_ret = serializer->try_lock(dur, y); Thread 99 "civetweb-worker" hit Breakpoint 2, RGWCompleteMultipart::execute (this=0x7ffed0016640, y=...) at ../src/rgw/rgw_op.cc:5956 5956 if (op_ret < 0) { (gdb) p op_ret $2 = -2
on this scenario the new flow will be executed.
rgw log:
2021-04-07T16:32:12.592+0300 7ffedeffd700 0 req 47 3.161037683s s3:complete_multipart failed to acquire lock 2021-04-07T16:32:12.592+0300 7ffedeffd700 20 req 47 3.161037683s s3:complete_multipart check already complete multipart: 1c816972-ff50-47f6-9d88-88c87c8b9343.4183.1__multipart_datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f.2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w.1 2021-04-07T16:32:12.593+0300 7ffedeffd700 2 req 47 3.162037611s s3:complete_multipart completing 2021-04-07T16:32:12.594+0300 7ffedeffd700 20 req 47 3.163037539s get_system_obj_state: rctx=0x7ffedeff8470 obj=default.rgw.log:script.postrequest. state=0x7fffe40e10a0 s->prefetch_data=0 2021-04-07T16:32:12.594+0300 7ffedeffd700 20 cache get: touching lru, lru_counter=59 promotion_ts=55 2021-04-07T16:32:12.594+0300 7ffedeffd700 10 moving default.rgw.log++script.postrequest. to cache LRU end 2021-04-07T16:32:12.594+0300 7ffedeffd700 10 cache get: name=default.rgw.log++script.postrequest. : hit (negative entry) 2021-04-07T16:32:12.594+0300 7ffedeffd700 2 req 47 3.163037539s s3:complete_multipart op status=1901 2021-04-07T16:32:12.594+0300 7ffedeffd700 2 req 47 3.163037539s s3:complete_multipart http status=202 2021-04-07T16:32:12.594+0300 7ffedeffd700 1 ====== req done req=0x7ffedeffa5a0 op status=1901 http_status=202 latency=3.163037539s ======
Potential false-positive scenarios that are handled correctly:
1. wrong key
2. wring bucket
3. wrong UploadId
caught by:
ret = data_ioctx.stat(chk_oid, nullptr, nullptr); ioctx.stat() ret = " << ret << std::endl; if (ret < 0) { ldpp_dout(this, 0) << __func__ << ": ERROR: ioctx.stat(" << chk_oid << ") returned ret = " << ret << dendl; op_ret = -ERR_NO_SUCH_UPLOAD; return;
(ret will be = -2)
example:
2021-04-07T16:23:48.823+0300 7ffedeffd700 0 req 44 6.753080368s s3:complete_multipart failed to acquire lock 2021-04-07T16:23:48.823+0300 7ffedeffd700 20 req 44 6.753080368s s3:complete_multipart check already complete multipart: 1c816972-ff50-47f6-9d88-88c87c8b9343.4183.1__multipart_datastorage/re gistry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4fZ.2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w.1 2021-04-07T16:23:48.824+0300 7ffedeffd700 0 req 44 6.754080296s s3:complete_multipart execute: ERROR: ioctx.stat(1c816972-ff50-47f6-9d88-88c87c8b9343.4183.1__multipart_datastorage/registry/ sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4fZ.2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w.1) returned ret = -2 2021-04-07T16:23:48.824+0300 7ffedeffd700 2 req 44 6.754080296s s3:complete_multipart op status=-2009 2021-04-07T16:23:48.824+0300 7ffedeffd700 2 req 44 6.754080296s s3:complete_multipart http status=404
an 404 is returned above.
Potential false-positive scenarios that are NOT handled correctly yet (related to Etag verification):
1. specifying a wrong ETag returns 200 OK (as noted in the github comments ETags are not verified)
2. a complete-multipart request with any previously successful UploadId will returns 200 OK.
In tests with the Quay client, reproducing the network error scenario preventing the 200 OK transmition,
the false-positive scenarios above related to Etag verificationdo do not occur.
Updated by Mark Kogan over 2 years ago
- Status changed from In Progress to Resolved
Updated by J. Eric Ivancich over 2 years ago
- Related to Bug #53808: Invalid response for CompleteMultipartUpload 500 'InternalError': This multipart completion is already in progress added
Updated by Casey Bodley over 2 years ago
- Status changed from Resolved to Pending Backport
- Tags set to multipart
- Backport set to octopus pacific
Updated by Backport Bot over 2 years ago
- Copied to Backport #53867: octopus: consecutive complete-multipart-upload requests following the 1st (successful) one on the same upload-id result in 500 added
Updated by Backport Bot over 2 years ago
- Copied to Backport #53868: pacific: consecutive complete-multipart-upload requests following the 1st (successful) one on the same upload-id result in 500 added
Updated by Backport Bot over 1 year ago
- Tags changed from multipart to multipart backport_processed
Updated by Konstantin Shalygin over 1 year ago
- Tracker changed from Fix to Bug
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100
- Tags changed from multipart backport_processed to multipart
- Regression set to No
- Severity set to 3 - minor