Project

General

Profile

Fix #50141

consecutive complete-multipart-upload requests following the 1st (successful) one on the same upload-id result in 500

Added by Mark Kogan about 1 year ago. Updated 4 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Support
Tags:
multipart
Backport:
octopus pacific
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

when the 200 OK of a complete_multipart request gets lost in network transmission to the client (quay geo-replicated zone),
the client will attempt re-sending complete_multipart requests which will result in 500 response,
until after 3 attempts it will disable the RGW storage backend and stop functioning.

rgw log:

2021-03-06 03:08:35.173315 7f87b915b700  0 RGWCompleteMultipart::execute() failed to acquire lock 
2021-03-06 03:08:35.173341 7f87b915b700  2 req 17767:0.001860:s3:POST /xxx-registry/sha256/14/142835ba5a0abf568fdde5490fae18b16ca4eca56a66e45136b7065428564e63:complete_multipart:completing
2021-03-06 03:08:35.173426 7f87b915b700  2 req 17767:0.001945:s3:POST /xxx-registry/sha256/14/142835ba5a0abf568fdde5490fae18b16ca4eca56a66e45136b7065428564e63:complete_multipart:op status=-2200
2021-03-06 03:08:35.173437 7f87b915b700  2 req 17767:0.001956:s3:POST /xxx-registry/sha256/14/142835ba5a0abf568fdde5490fae18b16ca4eca56a66e45136b7065428564e63:complete_multipart:http status=500
2021-03-06 03:08:35.173443 7f87b915b700  1 ====== req done req=0x7f87b9154f70 op status=-2200 http_status=500 ======


Related issues

Related to rgw - Bug #53808: Invalid response for CompleteMultipartUpload 500 'InternalError': This multipart completion is already in progress Resolved
Copied to rgw - Backport #53867: octopus: consecutive complete-multipart-upload requests following the 1st (successful) one on the same upload-id result in 500 In Progress
Copied to rgw - Backport #53868: pacific: consecutive complete-multipart-upload requests following the 1st (successful) one on the same upload-id result in 500 In Progress

History

#1 Updated by Matt Benjamin about 1 year ago

  • Pull request ID set to 40594

#2 Updated by Mark Kogan about 1 year ago

reproducing below a flow that is similar to a Quay with geo-replication zone configuration using aws cli:

1st multipart-complete:

dd if=/dev/zero of=container.img bs=1M count=4

aws --endpoint-url http://localhost:8000 s3 mb s3://alln-registry

aws --endpoint-url http://localhost:8000 s3api create-multipart-upload --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' | jq
{
    "Bucket": "alln-registry",
    "Key": "datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f",
    "UploadId": "2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w" 
}

export UPID='2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w'

aws --endpoint-url http://localhost:8000 s3api upload-part --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' --part-number 1 --body ./container.img --upload-id "${UPID}" 
{
    "ETag": "\"b5cfa9d6c8febd618f91ac2843d50a1c\"" 
}

aws --endpoint-url http://localhost:8000 s3api list-parts --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' --upload-id "${UPID}" 
{
  "Parts": [
    {
      "PartNumber": 1,
      "LastModified": "2021-04-07T11:50:59.516Z",
      "ETag": "\"b5cfa9d6c8febd618f91ac2843d50a1c\"",
      "Size": 4194304
    }
  ],
  "Initiator": null,
  "Owner": {
    "DisplayName": "",
    "ID": "" 
  },
  "StorageClass": "STANDARD" 
}

cat <<_EOF > fileparts.json
{
    "Parts": [{
        "ETag": "b5cfa9d6c8febd618f91ac2843d50a1c",
        "PartNumber":1
    }]
}
_EOF

aws --endpoint-url http://localhost:8000 s3api complete-multipart-upload --multipart-upload file://fileparts.json --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' --upload-id "${UPID}" 

in gdb breakpoint at src/rgw/rgw_op.cc : RGWCompleteMultipart::execute()

serializer = meta_obj->get_serializer("RGWCompleteMultipart");
  op_ret = serializer->try_lock(dur, y);
Thread 99 "civetweb-worker" hit Breakpoint 2, RGWCompleteMultipart::execute (this=0x7ffed0016640, y=...) at ../src/rgw/rgw_op.cc:5956
5956      if (op_ret < 0) {
(gdb) p op_ret
$1 = 0

on the 1st multipart-complete request the execution flows as previously and will not enter new flow (does not check for the existence of the first part object before multipart completes)

2nd multipart-complete - when the 1st complete-multipart 200 OK gets lost in transmission (network problem occured), Quay will re-send a complete-multipart request:

aws --endpoint-url http://localhost:8000 s3api complete-multipart-upload --multipart-upload file://fileparts.json --bucket alln-registry --key 'datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f' --upload-id "${UPID}" 

in gdb breakpoint at src/rgw/rgw_op.cc : RGWCompleteMultipart::execute()
serializer = meta_obj->get_serializer("RGWCompleteMultipart");
  op_ret = serializer->try_lock(dur, y);
Thread 99 "civetweb-worker" hit Breakpoint 2, RGWCompleteMultipart::execute (this=0x7ffed0016640, y=...) at ../src/rgw/rgw_op.cc:5956      
5956      if (op_ret < 0) {      
(gdb) p op_ret      
$2 = -2

on this scenario the new flow will be executed.

rgw log:

2021-04-07T16:32:12.592+0300 7ffedeffd700  0 req 47 3.161037683s s3:complete_multipart failed to acquire lock
2021-04-07T16:32:12.592+0300 7ffedeffd700 20 req 47 3.161037683s s3:complete_multipart check already complete multipart: 1c816972-ff50-47f6-9d88-88c87c8b9343.4183.1__multipart_datastorage/registry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4f.2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w.1
2021-04-07T16:32:12.593+0300 7ffedeffd700  2 req 47 3.162037611s s3:complete_multipart completing
2021-04-07T16:32:12.594+0300 7ffedeffd700 20 req 47 3.163037539s get_system_obj_state: rctx=0x7ffedeff8470 obj=default.rgw.log:script.postrequest. state=0x7fffe40e10a0 s->prefetch_data=0
2021-04-07T16:32:12.594+0300 7ffedeffd700 20 cache get: touching lru, lru_counter=59 promotion_ts=55
2021-04-07T16:32:12.594+0300 7ffedeffd700 10 moving default.rgw.log++script.postrequest. to cache LRU end
2021-04-07T16:32:12.594+0300 7ffedeffd700 10 cache get: name=default.rgw.log++script.postrequest. : hit (negative entry)
2021-04-07T16:32:12.594+0300 7ffedeffd700  2 req 47 3.163037539s s3:complete_multipart op status=1901
2021-04-07T16:32:12.594+0300 7ffedeffd700  2 req 47 3.163037539s s3:complete_multipart http status=202
2021-04-07T16:32:12.594+0300 7ffedeffd700  1 ====== req done req=0x7ffedeffa5a0 op status=1901 http_status=202 latency=3.163037539s ======

Potential false-positive scenarios that are handled correctly:
1. wrong key
2. wring bucket
3. wrong UploadId

caught by:

ret = data_ioctx.stat(chk_oid, nullptr, nullptr);
ioctx.stat() ret = " << ret << std::endl;
      if (ret < 0) {
        ldpp_dout(this, 0) << __func__ << ": ERROR: ioctx.stat(" << chk_oid << ") returned ret = " << ret << dendl;
        op_ret = -ERR_NO_SUCH_UPLOAD;
        return;

(ret will be = -2)

example:

2021-04-07T16:23:48.823+0300 7ffedeffd700  0 req 44 6.753080368s s3:complete_multipart failed to acquire lock
2021-04-07T16:23:48.823+0300 7ffedeffd700 20 req 44 6.753080368s s3:complete_multipart check already complete multipart: 1c816972-ff50-47f6-9d88-88c87c8b9343.4183.1__multipart_datastorage/re
gistry/sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4fZ.2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w.1
2021-04-07T16:23:48.824+0300 7ffedeffd700  0 req 44 6.754080296s s3:complete_multipart execute: ERROR: ioctx.stat(1c816972-ff50-47f6-9d88-88c87c8b9343.4183.1__multipart_datastorage/registry/
sha256/07/07680f8abcbce9cff4b73de2d6ce0aa9d65f35554da0f856fbd4cb4a4492dd4fZ.2~zVVHC7y1pqXH0zbp2_bCcZa61teKt8w.1) returned ret = -2

2021-04-07T16:23:48.824+0300 7ffedeffd700  2 req 44 6.754080296s s3:complete_multipart op status=-2009
2021-04-07T16:23:48.824+0300 7ffedeffd700  2 req 44 6.754080296s s3:complete_multipart http status=404

an 404 is returned above.

Potential false-positive scenarios that are NOT handled correctly yet (related to Etag verification):
1. specifying a wrong ETag returns 200 OK (as noted in the github comments ETags are not verified)
2. a complete-multipart request with any previously successful UploadId will returns 200 OK.

In tests with the Quay client, reproducing the network error scenario preventing the 200 OK transmition,
the false-positive scenarios above related to Etag verificationdo do not occur.

#3 Updated by Mark Kogan 8 months ago

  • Status changed from In Progress to Resolved

#4 Updated by J. Eric Ivancich 4 months ago

  • Related to Bug #53808: Invalid response for CompleteMultipartUpload 500 'InternalError': This multipart completion is already in progress added

#5 Updated by Casey Bodley 4 months ago

  • Status changed from Resolved to Pending Backport
  • Tags set to multipart
  • Backport set to octopus pacific

#6 Updated by Backport Bot 4 months ago

  • Copied to Backport #53867: octopus: consecutive complete-multipart-upload requests following the 1st (successful) one on the same upload-id result in 500 added

#7 Updated by Backport Bot 4 months ago

  • Copied to Backport #53868: pacific: consecutive complete-multipart-upload requests following the 1st (successful) one on the same upload-id result in 500 added

Also available in: Atom PDF