Project

General

Profile

Actions

Bug #64090

closed

RGW S3 signing regression

Added by Anthony D'Atri 3 months ago. Updated about 1 month ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
sigv4
Backport:
quincy reef
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a cluster whose clients include Spark / Hadoop and Mimir.

Initial deployment was with 17.2.5, all was well.

Updated to 17.2.6 without a hitch.

Updated to 17.2.7 and Spark / Hadoop broke with 400 errors.

https://tracker.ceph.com/issues/17520 and https://tracker.ceph.com/issues/18965 are old tickets that would seem to be related.

https://github.com/ceph/ceph/pull/53771 may be causal.

We had to work around this by setting spark.hadoop.fs.s3a.signing-algorithm=S3SignerType


Related issues 1 (1 open0 closed)

Is duplicate of rgw - Bug #63153: Uploads by AWS Go SDK v2 fail with XAmzContentSHA256Mismatch when Checksum is requestedPending BackportMatt Benjamin

Actions
Actions #1

Updated by Anthony D'Atri 3 months ago

  • Description updated (diff)
Actions #2

Updated by Casey Bodley 3 months ago

  • Status changed from New to Need More Info
  • Tags set to sigv4

can you share the exact error code/message for these failures?

can you capture rgw debug logs of the requests? seeing all of the request header names, at least, would help to confirm or rule out https://github.com/ceph/ceph/pull/53771 as the cause

what version of hadoop is this? we have some test coverage of hadoop/s3a using the default signer, but the hadoop versions we test are pretty old (2.9.2 and 3.2.0)

Actions #3

Updated by Anthony D'Atri 3 months ago

The affected Hadoop instance is 3.3 I believe, though 3.1.5 is in use elsewhere.

HTTP error code is 400.

Additional context: https://community.dremio.com/t/use-24-3-and-17-2-7-ceph-clusters-to-configure-distributed-storage-service-amazon-s3-status-code-400-error-code-xamzcontentsha256mismatch/11312

and this looks a lot like https://tracker.ceph.com/issues/18965

```Exception in thread "main" org.apache.hadoop.fs.s3a.AWSBadRequestException: PUT 0-byte object on MLO-BTO/unified/day=20240117/hour=05/current-auctions/: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 400; Error Code: XAmzContentSHA256Mismatch; Request ID: tx00000742c2973adfbd54a-0065a79878-21bf961-ceph-objectstore; S3 Extended Request ID: 21bf961-ceph-objectstore-ceph-objectstore; Proxy: null), S3 Extended Request ID: 21bf961-ceph-objectstore-ceph-objectstore:XAmzContentSHA256Mismatch: null (Service: Amazon S3; Status Code: 400; Error Code: XAmzContentSHA256Mismatch; Request ID: tx00000742c2973adfbd54a-0065a79878-21bf961-ceph-objectstore; S3 Extended Request ID: 21bf961-ceph-objectstore-ceph-objectstore; Proxy: null)```

```[root@k8sintcp1 ~]# k logs rook-ceph-rgw-ceph-objectstore-a-5ddf6786fd-v86tj | grep a2 -i http_status=400 | head
Defaulted container "rgw" out of: rgw, log-collector, chown-container-data-dir (init)
debug 2024-01-17T14:09:39.023+0000 7f7c836e1700 1 beast: 0x7f7c13969710: 10.108.0.220 - mlo-datalake-prod [17/Jan/2024:14:09:39.020 +0000] "GET /mlo-datalake-prod/?list-type=2&delimiter=%2F&max-keys=2&prefix=MLO-BTO%2Funified%2F&fetch-owner=false HTTP/1.1" 200 529 - "Hadoop 3.3.1, aws-sdk-java/1.11.901 Linux/5.18.15-1.el8.elrepo.x86_64 OpenJDK_64-Bit_Server_VM/11.0.15+10 java/11.0.15 scala/2.12.15 vendor/Oracle_Corporation" - latency=0.003000014s
debug 2024-01-17T14:09:39.038+0000 7f7c726bf700 1 ====== starting new request req=0x7f7c13969710 =====
debug 2024-01-17T14:09:39.038+0000 7f7c726bf700 1 ====== req done req=0x7f7c13969710 op status=-2040 http_status=400 latency=0.000000000s ======
debug 2024-01-17T14:09:39.038+0000 7f7c726bf700 1 beast: 0x7f7c13969710: 10.108.0.220 - mlo-datalake-prod [17/Jan/2024:14:09:39.038 +0000] "PUT /mlo-datalake-prod/MLO-BTO/unified/day%3D20240117/hour%3D10/current-auctions/ HTTP/1.1" 400 273 - "Hadoop 3.3.1, aws-sdk-java/1.11.901 Linux/5.18.15-1.el8.elrepo.x86_64 OpenJDK_64-Bit_Server_VM/11.0.15+10 java/11.0.15 scala/2.12.15 vendor/Oracle_Corporation" - latency=0.000000000s
debug 2024-01-17T14:09:39.642+0000 7f7c95705700 1 ====== starting new request req=0x7f7c13969710 =====
[root@k8sintcp1 ~]# k logs rook-ceph-rgw-ceph-objectstore-a-5ddf6786fd-v86tj | grep -a2 -i 'PUT /mlo-datalake-prod' | tail
Defaulted container "rgw" out of: rgw, log-collector, chown-container-data-dir (init)
-

debug 2024-01-17T14:56:54.735+0000 7f7cf9fce700 1 ====== starting new request req=0x7f7c14581710 =====
debug 2024-01-17T14:56:55.835+0000 7f7cb7749700 1 ====== req done req=0x7f7c14581710 op status=0 http_status=200 latency=1.100005150s ======
debug 2024-01-17T14:56:55.835+0000 7f7cb7749700 1 beast: 0x7f7c14581710: 10.109.0.251 - mlo-datalake-prod [17/Jan/2024:14:56:54.735 +0000] "PUT /mlo-datalake-prod/MLO-Parsed-Auditlog-Ceph-Uploader/processing_id%3D20240117133107/event_timestamp%3D202401171200/.distcp.tmp.attempt_1704430296069_39176_m_000010_0?uploadId=2%7EjaDzHTz9IlnVqWe9FrQqaTyGo8Dbz0X&partNumber=1 HTTP/1.1" 200 67108864 - "User-Agent: APN/1.0 Hortonworks/1.0 HDP/3.1.5.0-152, Hadoop 3.1.1.3.1.5.0-152, aws-sdk-java/1.11.375 Linux/3.10.0-1160.90.1.el7.x86_64 OpenJDK_64-Bit_Server_VM/25.372-b07 java/1.8.0_372" - latency=1.100005150s
debug 2024-01-17T14:56:56.378+0000 7f7c7feda700 1 ====== starting new request req=0x7f7c142fc710 =====
debug 2024-01-17T14:56:56.379+0000 7f7c9c713700 1 ====== req done req=0x7f7c142fc710 op status=0 http_status=200 latency=0.001000004s ======
--
debug 2024-01-17T14:56:56.835+0000 7f7cc976d700 1 ====== starting new request req=0x7f7c14581710 =====
debug 2024-01-17T14:56:58.051+0000 7f7ca6727700 1 ====== req done req=0x7f7c14581710 op status=0 http_status=200 latency=1.216005683s ======
debug 2024-01-17T14:56:58.051+0000 7f7ca6727700 1 beast: 0x7f7c14581710: 10.109.0.251 - mlo-datalake-prod [17/Jan/2024:14:56:56.835 +0000] "PUT /mlo-datalake-prod/MLO-Parsed-Auditlog-Ceph-Uploader/processing_id%3D20240117133107/event_timestamp%3D202401171200/.distcp.tmp.attempt_1704430296069_39176_m_000045_0?uploadId=2%7E3B77n--JrF2pJvm9mcQ9QazXG3U-cxj&partNumber=1 HTTP/1.1" 200 67108864 - "User-Agent: APN/1.0 Hortonworks/1.0 HDP/3.1.5.0-152, Hadoop 3.1.1.3.1.5.0-152, aws-sdk-java/1.11.375 Linux/3.10.0-1160.90.1.el7.x86_64 OpenJDK_64-Bit_Server_VM/25.372-b07 java/1.8.0_372" - latency=1.216005683s```

Actions #4

Updated by Casey Bodley 3 months ago

  • Status changed from Need More Info to New
Actions #5

Updated by Casey Bodley 3 months ago

thanks Anthony,

Status Code: 400; Error Code: XAmzContentSHA256Mismatch

this error code narrows it down, at least. there's only one place in rgw that returns this: https://github.com/ceph/ceph/blob/f4758e5/src/rgw/rgw_op.cc#L1411

that should rule out https://github.com/ceph/ceph/pull/53771, which relates to SignatureDoesNotMatch errors

Matt is working on https://tracker.ceph.com/issues/63153 in https://github.com/ceph/ceph/pull/54856 which includes various fixes related to XAmzContentSHA256Mismatch. maybe this is addressed there? it's hard to be sure without seeing all of the request headers for one of these failing requests

Actions #6

Updated by Casey Bodley 3 months ago

Anthony D'Atri wrote:

Updated to 17.2.6 without a hitch.

Updated to 17.2.7 and Spark / Hadoop broke with 400 errors.

https://github.com/ceph/ceph/pull/53266 was backported to quincy for 17.2.7. that was an optimization for AWSv4ComplMulti which is responsible for calculating these checksums, but i can't spot any issues with that change

Actions #7

Updated by Casey Bodley 3 months ago

  • Priority changed from Normal to High
  • Backport set to quincy reef
  • Regression changed from No to Yes

Casey Bodley wrote:

https://github.com/ceph/ceph/pull/53266 was backported to quincy for 17.2.7. that was an optimization for AWSv4ComplMulti which is responsible for calculating these checksums, but i can't spot any issues with that change

i received a similar report via email which claims that https://github.com/ceph/ceph/pull/53266 introduced a regression for zero-length uploads with STREAMING-AWS4-HMAC-SHA256-PAYLOAD. they're working on a fix

Actions #8

Updated by Casey Bodley 3 months ago

Anthony D'Atri wrote:

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSBadRequestException: PUT 0-byte object on ...

i see that hadoop was writing an empty object there too

Matt confirmed in https://github.com/ceph/ceph/pull/54856#issuecomment-1911069077 that

The current version of this change does handle the 0-byte upload cases I created.

my hope is that we can backport all of https://github.com/ceph/ceph/pull/54856 to fix this regression

Actions #9

Updated by Casey Bodley 3 months ago

  • Pull request ID set to 54856
Actions #10

Updated by Casey Bodley 3 months ago

  • Is duplicate of Bug #63153: Uploads by AWS Go SDK v2 fail with XAmzContentSHA256Mismatch when Checksum is requested added
Actions #11

Updated by Casey Bodley 3 months ago

  • Status changed from New to Duplicate
  • Pull request ID deleted (54856)
Actions #12

Updated by Anthony D'Atri 3 months ago

Since this did not happen with 17.2.5 and 17.2.6 I consider this a regression, not a duplicate.

Actions #13

Updated by Andrei Neagoe about 1 month ago

Observed similar issues after upgrade from 16.2.14 to 16.2.15.
MinIO client is throwing exception when trying to upload an empty file:

minioc: <ERROR> Failed to copy `/home/andrei/testzerofile`. The provided 'x-amz-content-sha256' header does not match what was computed.

As well, with hadoop 3.3.4 and deltalake 2.4.0, we get the following:

org.apache.hadoop.fs.s3a.AWSBadRequestException: PUT 0-byte object  on test/_delta_log: com.amazonaws.services.s3.model.AmazonS3Exception: XAmzContentSHA256Mismatch (Service: Amazon S3; Status Code: 400; Error Code: XAmzContentSHA256Mismatch; Request ID: tx00000c6dffb399a50679a-006602331a-cf9e5b4-default; S3 Extended Request ID: cf9e5b4-default-default; Proxy: null), S3 Extended Request ID: cf9e5b4-default-default:XAmzContentSHA256Mismatch: XAmzContentSHA256Mismatch (Service: Amazon S3; Status Code: 400; Error Code: XAmzContentSHA256Mismatch; Request ID: tx00000c6dffb399a50679a-006602331a-cf9e5b4-default; S3 Extended Request ID: cf9e5b4-default-default; Proxy: null)

Actions #14

Updated by Andrei Neagoe about 1 month ago

I was able to run minio client with debug output to show the headers:

[andrei@andrei-nb:~]$ minioc --debug cp testzerofile ceph_adm/debug-bucket/
 0 B / ? ┃░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░▓┃minioc: <DEBUG> GET /debug-bucket/?location= HTTP/1.1
Host: s3.example.com
User-Agent: MinIO (linux; amd64) minio-go/v7.0.67 minioc/RELEASE.2024-03-20T21-07-29Z
Accept-Encoding: zstd,gzip
Authorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240326/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20240326T091424Z

minioc: <DEBUG> HTTP/1.1 200 OK
Content-Length: 134
Date: Tue, 26 Mar 2024 09:14:24 GMT
X-Amz-Request-Id: tx000006f2af6881e949595-00660291f0-d01af68-default

minioc: <DEBUG> Response Time: 91.234073ms

minioc: <DEBUG> GET /debug-bucket/?object-lock= HTTP/1.1
Host: s3.example.com
User-Agent: MinIO (linux; amd64) minio-go/v7.0.67 minioc/RELEASE.2024-03-20T21-07-29Z
Accept-Encoding: zstd,gzip
Authorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240326/default/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20240326T091424Z

minioc: <DEBUG> HTTP/1.1 404 Not Found
Content-Length: 271
Accept-Ranges: bytes
Content-Type: application/xml
Date: Tue, 26 Mar 2024 09:14:24 GMT
X-Amz-Request-Id: tx00000f1ec9ced82b6af70-00660291f0-d01af7a-default

<?xml version="1.0" encoding="UTF-8"?><Error><Code>ObjectLockConfigurationNotFoundError</Code><Message></Message><BucketName>debug-bucket</BucketName><RequestId>tx00000f1ec9ced82b6af70-00660291f0-d01af7a-default</RequestId><HostId>d01af7a-default-default</HostId></Error>minioc: <DEBUG> Response Time: 19.155641ms

 0 B / ? ┃░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░▓┃minioc: <DEBUG> HEAD /debug-bucket/ HTTP/1.1
Host: s3.example.com
User-Agent: MinIO (linux; amd64) minio-go/v7.0.67 minioc/RELEASE.2024-03-20T21-07-29Z
Authorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240326/default/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20240326T091425Z

minioc: <DEBUG> HTTP/1.1 200 OK
Date: Tue, 26 Mar 2024 09:14:25 GMT
X-Amz-Request-Id: tx00000083b068995688a56-00660291f1-cfae528-default
X-Rgw-Bytes-Used: 7950
X-Rgw-Object-Count: 1
X-Rgw-Quota-Bucket-Objects: -1
X-Rgw-Quota-Bucket-Size: -1
X-Rgw-Quota-Max-Buckets: 1000
X-Rgw-Quota-User-Objects: -1
X-Rgw-Quota-User-Size: -1
Content-Length: 0

minioc: <DEBUG> Response Time: 19.914462ms

minioc: <DEBUG> PUT /debug-bucket/testzerofile HTTP/1.1
Host: s3.example.com
User-Agent: MinIO (linux; amd64) minio-go/v7.0.67 minioc/RELEASE.2024-03-20T21-07-29Z
Transfer-Encoding: chunked
Accept-Encoding: zstd,gzip
Authorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240326/default/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-decoded-content-length,Signature=**REDACTED**
Content-Type: application/octet-stream
X-Amz-Content-Sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD
X-Amz-Date: 20240326T091425Z
X-Amz-Decoded-Content-Length: 0

minioc: <DEBUG> HTTP/1.1 400 Bad Request
Content-Length: 260
Accept-Ranges: bytes
Content-Type: application/xml
Date: Tue, 26 Mar 2024 09:14:25 GMT
X-Amz-Request-Id: tx00000b0bd45c1f16b328e-00660291f1-cfae516-default

<?xml version="1.0" encoding="UTF-8"?><Error><Code>XAmzContentSHA256Mismatch</Code><Message></Message><BucketName>debug-bucket</BucketName><RequestId>tx00000b0bd45c1f16b328e-00660291f1-cfae516-default</RequestId><HostId>cfae516-default-default</HostId></Error>minioc: <DEBUG> Response Time: 20.747667ms

minioc: <ERROR> Failed to copy `/home/andrei/testzerofile`. The provided 'x-amz-content-sha256' header does not match what was computed.
 (3) cp-main.go:626 cmd.doCopySession(..) Tags: [/home/andrei/testzerofile]
 (2) common-methods.go:570 cmd.uploadSourceToTargetURL(..) Tags: [/home/andrei/testzerofile]
 (1) common-methods.go:274 cmd.putTargetStream(..) Tags: [ceph_adm, http://s3.example.com/debug-bucket/testzerofile]
 (0) client-s3.go:1232 cmd.(*S3Client).Put(..)
 Release-Tag:RELEASE.2024-03-20T21-07-29Z | Commit:9043bbf545d2 | Host:andrei-nb.example.com | OS:linux | Arch:amd64 | Lang:go1.21.8 | Mem:5.9 MiB/19 MiB | Heap:5.9 MiB/11 MiB

Actions

Also available in: Atom PDF