Project

General

Profile

Actions

Bug #45013

closed

Nautilus: Intermittent dropped connections with iterate_obj() failed

Added by Alex Walender about 4 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Or Friedmann
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

Our cluster heavily uses S3 and we notice intermittent dropped connections in HAProxy error logs:

Mar 31 13:36:21 xxx haproxy[30442]: xxx.xx.xx.14:43836 [31/Mar/2020:13:36:08.606] rgw_8080_public~ rgw_8080_public/xxx 0/0/0/165/13022 200 2832351577 - - CD-- 1577/2/0/0/0 0/0 "GET /s3debug/file.txt HTTP/1.1

Corresponding to this, we find errors in rgw logs:

2020-04-01 14:25:54.043 7f7892a51700 0 iterate_obj() failed with -5

2020-04-01 14:25:54.043 7f7892a51700 1 ====== req done req=0x7f7892a4a8d0 op status=-5 http_status=200 latency=9.82881s ======

Sometimes the latency even reaches 100+ seconds. I'm not sure if this is really a rgw or a librados issue.
When this error occurs, the client (minio-client or s3cmd) reports that the peer has dropped the connection.
It seems that this randomly happens when downloading data. I made a script, which downloads the same file over and over again. A few times per hour this error occurs.

The ceph cluster itself is in a healthy state. Additionally, we found dropped packages in our infrastructure. Fixing this however did not resolve this described issue here.

RGW-Config:

[client.rgw.xxx]
rgw_frontends = "beast endpoint=xxx:8080" 

rgw keystone api version = 3
rgw keystone url = http://XXXX:5000
rgw keystone admin user = XXX
rgw keystone admin password = XXX
rgw keystone admin domain = default
rgw keystone admin project = service
rgw keystone accepted roles = admin, _member_, user
rgw keystone token cache size = 1000
rgw keystone revocation interval = 300
rgw s3 auth use keystone = true
rgw keystone verify ssl = false

Actions #1

Updated by Or Friedmann about 4 years ago

  • Assignee set to Or Friedmann
Actions #2

Updated by Or Friedmann about 4 years ago

Hi,

Can you please increase rgw debug level:

debug_rgw 20
debug_ms 1

and publish the logs?

Thank you

Actions #3

Updated by Or Friedmann about 4 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by Janek Bevendorff about 3 years ago

I am having the same or a similar issue and I don't quite know what's going on here.

The issue is reproducible only if I am accessing multiple files at the same time in parallel. And it seems like it really has to be MULTIPLE files, not several times the same file. Downloading one individual file is (almost) never an issue.

Here is what works:

- downloading an individual file using s3cmd or Python Boto3
- downloading the same file via Boto3 in several threads

And here is what doesn't:

- reading many files in a bucket via Boto3 at the same time (I can reproduce it with a Spark job with as few as 16)

The strangest thing is that it always affects the same files at about (but not exactly) the same byte offset (~ 3.5MB). At the moment, I have two files with which I can reproduce it (somewhat) reliably. Both are single-part objects (WARCs ~1GB in size).

Here is the verbose RadosGW log for the first object:

Feb 18 12:55:54 XXX radosgw[3894994]: 2021-02-18T12:55:54.249+0100 7f10bee25700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:55:54 XXX radosgw[3894994]: 2021-02-18T12:55:54.465+0100 7f11aa7fc700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:55:54 XXX radosgw[3894994]: 2021-02-18T12:55:54.793+0100 7f1176794700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:55:55 XXX radosgw[3894994]: 2021-02-18T12:55:55.385+0100 7f1122eed700  1 failed to read header: The socket was closed due to a timeout
Feb 18 12:55:55 XXX radosgw[3894994]: 2021-02-18T12:55:55.385+0100 7f11967d4700  5 failed to write response: Bad file descriptor
Feb 18 12:55:55 XXX radosgw[3894994]: 2021-02-18T12:55:55.385+0100 7f11967d4700  1 ====== req done http_status=400 ======
Feb 18 12:55:55 XXX radosgw[3894994]: 2021-02-18T12:55:55.489+0100 7f113c720700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:55:56 XXX radosgw[3894994]: 2021-02-18T12:55:56.593+0100 7f110e6c4700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.201+0100 7f1141f2b700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700  4 write_data failed: The socket was closed due to a timeout
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 -1 handle_data failed with exit code -1
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 -1 handle_data failed with exit code -1
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700  0 iterate_obj() failed with -1
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700  2 req 233064 65.028426619s s3:get_obj completing
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700  2 req 233064 65.028426619s s3:get_obj op status=-1
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700  2 req 233064 65.028426619s s3:get_obj http status=200
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700  1 ====== req done req=0x7f12b0294670 op status=-1 http_status=200 latency=65.028426619s ======
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700  1 beast: 0x7f12b0294670: XXX- - [2021-02-18T12:55:57.229123+0100] "GET /corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00002.warc.gz HTTP/1.1" 200 0 - "Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource" -
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 20 failed to read header: Bad file descriptor
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700  4 write_data failed: The socket was closed due to a timeout
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 -1 handle_data failed with exit code -1
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 -1 handle_data failed with exit code -1
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700  0 iterate_obj() failed with -1
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700  2 req 233074 65.020426566s s3:get_obj completing
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700  2 req 233074 65.020426566s s3:get_obj op status=-1
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700  2 req 233074 65.020426566s s3:get_obj http status=200
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700  1 ====== req done req=0x7f12a0572670 op status=-1 http_status=200 latency=65.020426566s ======
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700  1 beast: 0x7f12a0572670: XXX - - [2021-02-18T12:55:57.257123+0100] "GET /corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00009.warc.gz HTTP/1.1" 200 0 - "Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource" -
Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 20 failed to read header: Bad file descriptor
Feb 18 12:55:59 XXX radosgw[3894994]: 2021-02-18T12:55:59.497+0100 7f1169f7b700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:56:00 XXX radosgw[3894994]: 2021-02-18T12:56:00.345+0100 7f10d7e57700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:56:00 XXX radosgw[3894994]: 2021-02-18T12:56:00.885+0100 7f1180fa9700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:56:02 XXX radosgw[3894994]: 2021-02-18T12:56:02.009+0100 7f118e7c4700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:56:03 XXX radosgw[3894994]: 2021-02-18T12:56:03.353+0100 7f10c6634700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:56:08 XXX radosgw[3894994]: 2021-02-18T12:56:08.201+0100 7f10e8e79700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:56:08 XXX radosgw[3894994]: 2021-02-18T12:56:08.853+0100 7f1283fe7700  2 RGWDataChangesLog::ChangesRenewThread: start
Feb 18 12:56:10 XXX radosgw[3894994]: 2021-02-18T12:56:10.421+0100 7f111f6e6700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:56:11 XXX radosgw[3894994]: 2021-02-18T12:56:11.161+0100 7f117378e700 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
Feb 18 12:56:11 XXX radosgw[3894994]: 2021-02-18T12:56:11.161+0100 7f117378e700 20 RGWObjManifest::operator++(): result: ofs=29245440 stripe_ofs=29245440 part_ofs=0 rule->part_size=0
Feb 18 12:56:11 XXX radosgw[3894994]: 2021-02-18T12:56:11.161+0100 7f117378e700 20 rados->get_obj_iterate_cb oid=6112a2c2-2570-4c6a-8e57-709633df12eb.297104279.2__shadow_.sF_tSct0S_Fa22p35db9JlAXRFUrhmK_7 obj-ofs=29245440 read_ofs=0 len=4177920
Feb 18 12:56:11 XXX radosgw[3894994]: 2021-02-18T12:56:11.161+0100 7f117378e700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_ACCEPT_ENCODING=identity
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_AUTHORIZATION=XXX
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_HOST=XXX:7480
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_USER_AGENT=Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_VERSION=1.1
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_X_AMZ_CONTENT_SHA256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_X_AMZ_DATE=20210218T115613Z
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 REMOTE_ADDR=XXX
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 REQUEST_METHOD=GET
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 REQUEST_URI=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 SCRIPT_URI=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz
Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 SERVER_PORT=7480

and here's the error for the second one:

Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.343+0100 7f10c362e700  1 ====== req done req=0x7f12a05f3670 op status=-1 http_status=200 latency=65.060426476s ======
Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.343+0100 7f10c362e700  1 beast: 0x7f12a05f3670: XXX - - [2021-02-18T12:52:58.343950+0100] "GET /corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00002.warc.gz HTTP/1.1" 200 0 - "Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource" -
Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.343+0100 7f10c362e700 20 failed to read header: Bad file descriptor
Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.451+0100 7f10d4650700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.695+0100 7f10c9e3b700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:00 XXX radosgw[3894994]: 2021-02-18T12:53:00.059+0100 7f11116ca700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:01 XXX radosgw[3894994]: 2021-02-18T12:53:01.375+0100 7f10cb63e700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:03 XXX radosgw[3894994]: 2021-02-18T12:53:03.039+0100 7f115a75c700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:03 XXX radosgw[3894994]: 2021-02-18T12:53:03.599+0100 7f10bd622700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:07 XXX radosgw[3894994]: 2021-02-18T12:53:07.304+0100 7f11076b6700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:11 XXX radosgw[3894994]: 2021-02-18T12:53:11.964+0100 7f11a77f6700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:12 XXX radosgw[3894994]: 2021-02-18T12:53:12.852+0100 7f1283fe7700  2 RGWDataChangesLog::ChangesRenewThread: start
Feb 18 12:53:13 XXX radosgw[3894994]: 2021-02-18T12:53:13.520+0100 7f1123eef700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:14 XXX radosgw[3894994]: 2021-02-18T12:53:14.412+0100 7f119b7de700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.436+0100 7f10ca63c700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_ACCEPT_ENCODING=identity
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_AUTHORIZATION=XXXX
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_HOST=XXX:7480
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_USER_AGENT=Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_VERSION=1.1
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_X_AMZ_CONTENT_SHA256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_X_AMZ_DATE=20210218T115315Z
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 REMOTE_ADDR=XXX
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 REQUEST_METHOD=GET
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 REQUEST_URI=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 SCRIPT_URI=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 SERVER_PORT=7480
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700  1 ====== starting new request req=0x7f12b833d670 =====
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700  2 req 233058 0s initializing for trans_id = tx000000000000000038e62-00602e552b-15a727be-XXX
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 rgw api priority: s3=8 s3website=7
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 host=XXX
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 subdomain= domain=XXX in_hosted_domain=1 in_hosted_domain_s3website=0
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 final domain/bucket subdomain= domain=XXX in_hosted_domain=1 in_hosted_domain_s3website=0 s->info.domain=XXX s->info.request_uri=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 meta>> HTTP_X_AMZ_CONTENT_SHA256
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 meta>> HTTP_X_AMZ_DATE
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 x>> x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 x>> x-amz-date:20210218T115315Z
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s get_handler handler=22RGWHandler_REST_Obj_S3
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 handler=22RGWHandler_REST_Obj_S3
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700  2 req 233058 0s getting op 0
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 req 233058 0s s3:get_obj scheduling with dmclock client=2 cost=1
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 op=21RGWGetObj_ObjStore_S3
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700  2 req 233058 0s s3:get_obj verifying requester
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::StrategyRegistry::s3_main_strategy_t: trying rgw::auth::s3::AWSAuthStrategy
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::AWSAuthStrategy: trying rgw::auth::s3::S3AnonymousEngine
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::S3AnonymousEngine denied with reason=-1
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::AWSAuthStrategy: trying rgw::auth::s3::LocalEngine
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 v4 signature format = e2fc99164bbcd9c439c7841ab25f678821ce6bb9cee8938b23e9f74b74af3a68
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 v4 credential format = XXX/20210218/us-east-1/s3/aws4_request
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 access key id = XXX
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 credential scope = 20210218/us-east-1/s3/aws4_request
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 req 233058 0s canonical headers format = host:XXX:7480
                XXXXXXXXXXXXXXXXXXXXXXX
                XXXXXXXXXXXXXXXXXXXXXXX
                XXXXXXXXXXXXXXXXXXXXXXX
                XXXXXXXXXXXXXXXXXXXXXXX
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 date_k    = c5d419e31f0abe26c7595735635be7bc83dfbf5b97f0c2cbfa20875853611e21
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 region_k  = 56d1c1befda3c48b5dc455677b758eb2d17ee845d7def87997a24aff88a7df61
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 service_k = 5874b3496d084572aad5ea9917212d828b234b37e53c3c7d0d60e70956fe301d
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 signing_k = 328889a62e5082ac25d5a2e782697b8ed705faf400668a82384092f659663e45
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 generated signature = e2fc99164bbcd9c439c7841ab25f678821ce6bb9cee8938b23e9f74b74af3a68
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 req 233058 0s s3:get_obj string_to_sign=AWS4-HMAC-SHA256
                                              20210218T115315Z
                                              20210218/us-east-1/s3/aws4_request
                                              5b961df2c7c84779d8b877e6b2c3b81a6df454127677eba711e6acb79fc0e074
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 req 233058 0s s3:get_obj server signature=e2fc99164bbcd9c439c7841ab25f678821ce6bb9cee8938b23e9f74b74af3a68
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 req 233058 0s s3:get_obj client signature=e2fc99164bbcd9c439c7841ab25f678821ce6bb9cee8938b23e9f74b74af3a68
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 req 233058 0s s3:get_obj compare=0
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::LocalEngine granted access
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::AWSAuthStrategy granted access
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700  2 req 233058 0s s3:get_obj normalizing buckets and tenants
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 s->object=crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz s->bucket=corpus-commoncrawl-main-2020-34
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700  2 req 233058 0s s3:get_obj init permissions
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_system_obj_state: rctx=0x7f12b833ab48 obj=XXX.rgw.meta:root:corpus-commoncrawl-main-2020-34 state=0x7f105c05d2d0 s->prefetch_data=0
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 cache get: name=XXX.rgw.meta+root+corpus-commoncrawl-main-2020-34 : hit (requested=0x16, cached=0x17)
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_system_obj_state: s->obj_tag was set empty
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 cache get: name=XXX.rgw.meta+root+corpus-commoncrawl-main-2020-34 : hit (requested=0x11, cached=0x17)
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 decode_policy Read AccessControlPolicy<AccessControlPolicy xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>XXX-corpora</ID><DisplayName>XXX Corpora</DisplayName></Owner><AccessControlList><Grant><Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="CanonicalUser"><ID>XXX-corpora</ID><DisplayName>XXX Corpora</DisplayName></Grantee><Permission>FULL_CONTROL</Permission></Grant></AccessControlList></AccessControlPolicy>
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_system_obj_state: rctx=0x7f12b833af88 obj=XXX.rgw.meta:users.uid:XXX-corpora state=0x7f105c05d2d0 s->prefetch_data=0
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 cache get: name=XXX.rgw.meta+users.uid+XXX-corpora : hit (requested=0x6, cached=0x17)
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_system_obj_state: s->obj_tag was set empty
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 Read xattr: user.rgw.idtag
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 cache get: name=XXX.rgw.meta+users.uid+XXX-corpora : hit (requested=0x3, cached=0x17)
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700  2 req 233058 0s s3:get_obj recalculating target
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700  2 req 233058 0s s3:get_obj reading permissions
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_obj_state: rctx=0x7f12b833c290 obj=corpus-commoncrawl-main-2020-34:crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz state=0x7f105c094058 s->prefetch_data=1
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 WARNING: blocking librados call
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 10 manifest: total_size = 897189304
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 get_obj_state: setting s->obj_tag to 6112a2c2-2570-4c6a-8e57-709633df12eb.297104261.793585
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 15 decode_policy Read AccessControlPolicy<AccessControlPolicy xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>XXX-corpora</ID><DisplayName>XXX Corpora</DisplayName></Owner><AccessControlList><Grant><Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="CanonicalUser"><ID>XXX-corpora</ID><DisplayName>XXX Corpora</DisplayName></Grantee><Permission>FULL_CONTROL</Permission></Grant></AccessControlList></AccessControlPolicy>
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  2 req 233058 0.020000132s s3:get_obj init op
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  2 req 233058 0.020000132s s3:get_obj verifying op mask
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 req 233058 0.020000132s s3:get_obj required_mask= 1 user.op_mask=7
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  2 req 233058 0.020000132s s3:get_obj verifying op permissions
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 req 233058 0.020000132s s3:get_obj -- Getting permissions begin with perm_mask=49
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  5 req 233058 0.020000132s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=XXX-corpora, acct_name=XXX Corpora, subuser=, perm_mask=15, is_admin=0) mask=49
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  5 Searching permissions for uid=XXX-corpora
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  5 Found permission: 15
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  5 Searching permissions for group=1 mask=49
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  5 Permissions for group not found
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  5 Searching permissions for group=2 mask=49
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  5 Permissions for group not found
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  5 req 233058 0.020000132s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=XXX-corpora, acct_name=XXX Corpora, subuser=, perm_mask=15, is_admin=0), owner=XXX-corpora, perm=1
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 10 req 233058 0.020000132s s3:get_obj  identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=XXX-corpora, acct_name=XXX Corpora, subuser=, perm_mask=15, is_admin=0) requested perm (type)=1, policy perm=1, user_perm_mask=15, acl perm=1
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  2 req 233058 0.020000132s s3:get_obj verifying op params
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  2 req 233058 0.020000132s s3:get_obj pre-executing
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700  2 req 233058 0.020000132s s3:get_obj executing
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 get_obj_state: rctx=0x7f12b833c290 obj=corpus-commoncrawl-main-2020-34:crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz state=0x7f105c094058 s->prefetch_data=1
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.acl
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.compression
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.content_type
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.etag
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.idtag
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.manifest
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.pg_ver
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.source_zone
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.tail_tag
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.x-amz-content-sha256
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.x-amz-date
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.x-amz-decoded-content-length
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.804+0100 7f118dfc3700 15 Encryption mode:
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.804+0100 7f118dfc3700 20 get_obj_state: rctx=0x7f12b833c290 obj=corpus-commoncrawl-main-2020-34:crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz state=0x7f105c094058 s->prefetch_data=1
Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.804+0100 7f118dfc3700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 12:53:16 XXX radosgw[3894994]: 2021-02-18T12:53:16.660+0100 7f10fa69c700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920

The error messages are different, but in Boto3, they both result in the same IncompleteReadError exception with the following message:

IncompleteReadError: 3674448 read, but total bytes expected is 930282665.
IncompleteReadError: 3322872 read, but total bytes expected is 908263506.

As I said, the byte offsets differ every time, but they are always about 3.5MB.

Downloading said files individually is no problem at all and the log is simply filled with a bunch of these okay-looking status messages:

Feb 18 13:06:45 XXX radosgw[3894994]: 2021-02-18T13:06:45.513+0100 7f1110ec9700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 13:06:45 XXX radosgw[3894994]: 2021-02-18T13:06:45.557+0100 7f10dce61700 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
Feb 18 13:06:45 XXX radosgw[3894994]: 2021-02-18T13:06:45.557+0100 7f10dce61700 20 RGWObjManifest::operator++(): result: ofs=296632320 stripe_ofs=296632320 part_ofs=0 rule->part_size=0
Feb 18 13:06:45 XXX radosgw[3894994]: 2021-02-18T13:06:45.557+0100 7f10dce61700 20 rados->get_obj_iterate_cb oid=6112a2c2-2570-4c6a-8e57-709633df12eb.297104279.2__shadow_.B3JcFV1w2DcBXfGlYXWF6nPQfnXaIVs_71 obj-ofs=296632320 read_ofs=0 len=4177920

Any ideas what might be the reason? From what I can tell, the issue seems to be load- or latency-induced to some extent. At least the first file consistently gets a timeout error somewhere. There may be a knob to turn for this. The message for the second file looks more serious, but I can also download that file individually without issues.

And here's the thing that makes the least sense: I seem to be able to reproduce it only with a Python 3.6 client and not with a Python 3.8 client (same Boto3 version). At least I am not getting an exception. However, I do see a few of these messages scroll by in the RadosGW log:

Feb 18 13:30:45 XXX radosgw[3894994]: 2021-02-18T13:30:45.430+0100 7f1140728700  1 failed to read header: The socket was closed due to a timeout
Feb 18 13:30:45 XXX radosgw[3894994]: 2021-02-18T13:30:45.430+0100 7f1140728700  5 failed to write response: Bad file descriptor
Feb 18 13:30:45 XXX radosgw[3894994]: 2021-02-18T13:30:45.430+0100 7f1140728700  1 ====== req done http_status=400 ======
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.210+0100 7f11beffd700 20 processing logshard = reshard.0000000000
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.214+0100 7f11beffd700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000000 ret=-16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.214+0100 7f11beffd700  5 process_single_logshard(): failed to acquire lock on reshard.0000000000, ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.214+0100 7f11beffd700 20 finish processing logshard = reshard.0000000000 , ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.214+0100 7f11beffd700 20 processing logshard = reshard.0000000001
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.218+0100 7f11beffd700 20 finish processing logshard = reshard.0000000001 , ret = 0
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.218+0100 7f11beffd700 20 processing logshard = reshard.0000000002
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.222+0100 7f11beffd700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000002 ret=-16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.222+0100 7f11beffd700  5 process_single_logshard(): failed to acquire lock on reshard.0000000002, ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.222+0100 7f11beffd700 20 finish processing logshard = reshard.0000000002 , ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.222+0100 7f11beffd700 20 processing logshard = reshard.0000000003
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.226+0100 7f11beffd700 20 finish processing logshard = reshard.0000000003 , ret = 0
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.226+0100 7f11beffd700 20 processing logshard = reshard.0000000004
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.230+0100 7f11beffd700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000004 ret=-16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.230+0100 7f11beffd700  5 process_single_logshard(): failed to acquire lock on reshard.0000000004, ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.230+0100 7f11beffd700 20 finish processing logshard = reshard.0000000004 , ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.230+0100 7f11beffd700 20 processing logshard = reshard.0000000005
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 20 finish processing logshard = reshard.0000000005 , ret = 0
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 20 processing logshard = reshard.0000000006
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000006 ret=-16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700  5 process_single_logshard(): failed to acquire lock on reshard.0000000006, ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 20 finish processing logshard = reshard.0000000006 , ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 20 processing logshard = reshard.0000000007
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.238+0100 7f11beffd700 20 finish processing logshard = reshard.0000000007 , ret = 0
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.238+0100 7f11beffd700 20 processing logshard = reshard.0000000008
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.242+0100 7f11beffd700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000008 ret=-16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.242+0100 7f11beffd700  5 process_single_logshard(): failed to acquire lock on reshard.0000000008, ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.242+0100 7f11beffd700 20 finish processing logshard = reshard.0000000008 , ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.242+0100 7f11beffd700 20 processing logshard = reshard.0000000009
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.246+0100 7f11beffd700 20 finish processing logshard = reshard.0000000009 , ret = 0
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.246+0100 7f11beffd700 20 processing logshard = reshard.0000000010
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.250+0100 7f11beffd700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000010 ret=-16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.250+0100 7f11beffd700  5 process_single_logshard(): failed to acquire lock on reshard.0000000010, ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.250+0100 7f11beffd700 20 finish processing logshard = reshard.0000000010 , ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.250+0100 7f11beffd700 20 processing logshard = reshard.0000000011
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.254+0100 7f11beffd700 20 finish processing logshard = reshard.0000000011 , ret = 0
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.254+0100 7f11beffd700 20 processing logshard = reshard.0000000012
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.258+0100 7f11beffd700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000012 ret=-16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.258+0100 7f11beffd700  5 process_single_logshard(): failed to acquire lock on reshard.0000000012, ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.258+0100 7f11beffd700 20 finish processing logshard = reshard.0000000012 , ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.258+0100 7f11beffd700 20 processing logshard = reshard.0000000013
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.282+0100 7f11beffd700 20 finish processing logshard = reshard.0000000013 , ret = 0
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.282+0100 7f11beffd700 20 processing logshard = reshard.0000000014
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.286+0100 7f11beffd700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000014 ret=-16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.286+0100 7f11beffd700  5 process_single_logshard(): failed to acquire lock on reshard.0000000014, ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.286+0100 7f11beffd700 20 finish processing logshard = reshard.0000000014 , ret = -16
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.286+0100 7f11beffd700 20 processing logshard = reshard.0000000015
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.290+0100 7f11beffd700 20 finish processing logshard = reshard.0000000015 , ret = 0
Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.786+0100 7f1169f7b700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
Feb 18 13:30:48 XXX radosgw[3894994]: 2021-02-18T13:30:48.766+0100 7f112f706700 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
Feb 18 13:30:48 XXX radosgw[3894994]: 2021-02-18T13:30:48.766+0100 7f112f706700 20 RGWObjManifest::operator++(): result: ofs=33423360 stripe_ofs=33423360 part_ofs=0 rule->part_size=0
Feb 18 13:30:48 XXX radosgw[3894994]: 2021-02-18T13:30:48.766+0100 7f112f706700 20 rados->get_obj_iterate_cb oid=6112a2c2-2570-4c6a-8e57-709633df12eb.297104279.2__shadow_.NI8Cv6rnvmenLNaXv9ceAdZSqTpATLa_8 obj-ofs=33423360 read_ofs=0 len=4177920

There may be a totally benign reason for the difference between Python 3.6 and 3.8, but I don't know which one that would be, maybe a more forgiving urllib3.

This issue may be related: https://tracker.ceph.com/issues/47722

Actions #5

Updated by Janek Bevendorff about 3 years ago

Now that I have debugged this enough to write a detailed bug report, I am starting to believe that this is indeed a Python 3.6 client issue and the errors are simply client errors logged on server side. When I prematurely close a client connection, I am seeing the same iterate_obj errors.

In that case it could simply be a general latency error which isn't handled well by the client. Is that a possible reason?

Actions #6

Updated by Casey Bodley over 2 years ago

  • Status changed from Need More Info to Closed

Janek Bevendorff wrote:

Now that I have debugged this enough to write a detailed bug report, I am starting to believe that this is indeed a Python 3.6 client issue and the errors are simply client errors logged on server side. When I prematurely close a client connection, I am seeing the same iterate_obj errors.

In that case it could simply be a general latency error which isn't handled well by the client. Is that a possible reason?

this does sound like a client or network issue outside of rgw, especially because rgw is closing the connection due to timeout:

write_data failed: The socket was closed due to a timeout

Actions

Also available in: Atom PDF