Bug #45013
closedNautilus: Intermittent dropped connections with iterate_obj() failed
0%
Description
Hello,
Our cluster heavily uses S3 and we notice intermittent dropped connections in HAProxy error logs:
Mar 31 13:36:21 xxx haproxy[30442]: xxx.xx.xx.14:43836 [31/Mar/2020:13:36:08.606] rgw_8080_public~ rgw_8080_public/xxx 0/0/0/165/13022 200 2832351577 - - CD-- 1577/2/0/0/0 0/0 "GET /s3debug/file.txt HTTP/1.1
Corresponding to this, we find errors in rgw logs:
2020-04-01 14:25:54.043 7f7892a51700 0 iterate_obj() failed with -5
2020-04-01 14:25:54.043 7f7892a51700 1 ====== req done req=0x7f7892a4a8d0 op status=-5 http_status=200 latency=9.82881s ======
Sometimes the latency even reaches 100+ seconds. I'm not sure if this is really a rgw or a librados issue.
When this error occurs, the client (minio-client or s3cmd) reports that the peer has dropped the connection.
It seems that this randomly happens when downloading data. I made a script, which downloads the same file over and over again. A few times per hour this error occurs.
The ceph cluster itself is in a healthy state. Additionally, we found dropped packages in our infrastructure. Fixing this however did not resolve this described issue here.
RGW-Config:
[client.rgw.xxx] rgw_frontends = "beast endpoint=xxx:8080" rgw keystone api version = 3 rgw keystone url = http://XXXX:5000 rgw keystone admin user = XXX rgw keystone admin password = XXX rgw keystone admin domain = default rgw keystone admin project = service rgw keystone accepted roles = admin, _member_, user rgw keystone token cache size = 1000 rgw keystone revocation interval = 300 rgw s3 auth use keystone = true rgw keystone verify ssl = false
Updated by Or Friedmann about 4 years ago
Hi,
Can you please increase rgw debug level:
debug_rgw 20
debug_ms 1
and publish the logs?
Thank you
Updated by Or Friedmann about 4 years ago
- Status changed from New to Need More Info
Updated by Janek Bevendorff about 3 years ago
I am having the same or a similar issue and I don't quite know what's going on here.
The issue is reproducible only if I am accessing multiple files at the same time in parallel. And it seems like it really has to be MULTIPLE files, not several times the same file. Downloading one individual file is (almost) never an issue.
Here is what works:
- downloading an individual file using s3cmd or Python Boto3
- downloading the same file via Boto3 in several threads
And here is what doesn't:
- reading many files in a bucket via Boto3 at the same time (I can reproduce it with a Spark job with as few as 16)
The strangest thing is that it always affects the same files at about (but not exactly) the same byte offset (~ 3.5MB). At the moment, I have two files with which I can reproduce it (somewhat) reliably. Both are single-part objects (WARCs ~1GB in size).
Here is the verbose RadosGW log for the first object:
Feb 18 12:55:54 XXX radosgw[3894994]: 2021-02-18T12:55:54.249+0100 7f10bee25700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:55:54 XXX radosgw[3894994]: 2021-02-18T12:55:54.465+0100 7f11aa7fc700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:55:54 XXX radosgw[3894994]: 2021-02-18T12:55:54.793+0100 7f1176794700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:55:55 XXX radosgw[3894994]: 2021-02-18T12:55:55.385+0100 7f1122eed700 1 failed to read header: The socket was closed due to a timeout Feb 18 12:55:55 XXX radosgw[3894994]: 2021-02-18T12:55:55.385+0100 7f11967d4700 5 failed to write response: Bad file descriptor Feb 18 12:55:55 XXX radosgw[3894994]: 2021-02-18T12:55:55.385+0100 7f11967d4700 1 ====== req done http_status=400 ====== Feb 18 12:55:55 XXX radosgw[3894994]: 2021-02-18T12:55:55.489+0100 7f113c720700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:55:56 XXX radosgw[3894994]: 2021-02-18T12:55:56.593+0100 7f110e6c4700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.201+0100 7f1141f2b700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 4 write_data failed: The socket was closed due to a timeout Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 -1 handle_data failed with exit code -1 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 -1 handle_data failed with exit code -1 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 0 iterate_obj() failed with -1 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 2 req 233064 65.028426619s s3:get_obj completing Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 2 req 233064 65.028426619s s3:get_obj op status=-1 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 2 req 233064 65.028426619s s3:get_obj http status=200 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 1 ====== req done req=0x7f12b0294670 op status=-1 http_status=200 latency=65.028426619s ====== Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 1 beast: 0x7f12b0294670: XXX- - [2021-02-18T12:55:57.229123+0100] "GET /corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00002.warc.gz HTTP/1.1" 200 0 - "Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource" - Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.229+0100 7f10e166a700 20 failed to read header: Bad file descriptor Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 4 write_data failed: The socket was closed due to a timeout Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 -1 handle_data failed with exit code -1 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 -1 handle_data failed with exit code -1 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 0 iterate_obj() failed with -1 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 2 req 233074 65.020426566s s3:get_obj completing Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 2 req 233074 65.020426566s s3:get_obj op status=-1 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 2 req 233074 65.020426566s s3:get_obj http status=200 Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 1 ====== req done req=0x7f12a0572670 op status=-1 http_status=200 latency=65.020426566s ====== Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 1 beast: 0x7f12a0572670: XXX - - [2021-02-18T12:55:57.257123+0100] "GET /corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00009.warc.gz HTTP/1.1" 200 0 - "Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource" - Feb 18 12:55:57 XXX radosgw[3894994]: 2021-02-18T12:55:57.257+0100 7f112e704700 20 failed to read header: Bad file descriptor Feb 18 12:55:59 XXX radosgw[3894994]: 2021-02-18T12:55:59.497+0100 7f1169f7b700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:56:00 XXX radosgw[3894994]: 2021-02-18T12:56:00.345+0100 7f10d7e57700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:56:00 XXX radosgw[3894994]: 2021-02-18T12:56:00.885+0100 7f1180fa9700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:56:02 XXX radosgw[3894994]: 2021-02-18T12:56:02.009+0100 7f118e7c4700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:56:03 XXX radosgw[3894994]: 2021-02-18T12:56:03.353+0100 7f10c6634700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:56:08 XXX radosgw[3894994]: 2021-02-18T12:56:08.201+0100 7f10e8e79700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:56:08 XXX radosgw[3894994]: 2021-02-18T12:56:08.853+0100 7f1283fe7700 2 RGWDataChangesLog::ChangesRenewThread: start Feb 18 12:56:10 XXX radosgw[3894994]: 2021-02-18T12:56:10.421+0100 7f111f6e6700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:56:11 XXX radosgw[3894994]: 2021-02-18T12:56:11.161+0100 7f117378e700 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1 Feb 18 12:56:11 XXX radosgw[3894994]: 2021-02-18T12:56:11.161+0100 7f117378e700 20 RGWObjManifest::operator++(): result: ofs=29245440 stripe_ofs=29245440 part_ofs=0 rule->part_size=0 Feb 18 12:56:11 XXX radosgw[3894994]: 2021-02-18T12:56:11.161+0100 7f117378e700 20 rados->get_obj_iterate_cb oid=6112a2c2-2570-4c6a-8e57-709633df12eb.297104279.2__shadow_.sF_tSct0S_Fa22p35db9JlAXRFUrhmK_7 obj-ofs=29245440 read_ofs=0 len=4177920 Feb 18 12:56:11 XXX radosgw[3894994]: 2021-02-18T12:56:11.161+0100 7f117378e700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_ACCEPT_ENCODING=identity Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_AUTHORIZATION=XXX Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_HOST=XXX:7480 Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_USER_AGENT=Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_VERSION=1.1 Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_X_AMZ_CONTENT_SHA256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 HTTP_X_AMZ_DATE=20210218T115613Z Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 REMOTE_ADDR=XXX Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 REQUEST_METHOD=GET Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 REQUEST_URI=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 SCRIPT_URI=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz Feb 18 12:56:13 XXX radosgw[3894994]: 2021-02-18T12:56:13.781+0100 7f1147736700 20 SERVER_PORT=7480
and here's the error for the second one:
Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.343+0100 7f10c362e700 1 ====== req done req=0x7f12a05f3670 op status=-1 http_status=200 latency=65.060426476s ====== Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.343+0100 7f10c362e700 1 beast: 0x7f12a05f3670: XXX - - [2021-02-18T12:52:58.343950+0100] "GET /corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00002.warc.gz HTTP/1.1" 200 0 - "Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource" - Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.343+0100 7f10c362e700 20 failed to read header: Bad file descriptor Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.451+0100 7f10d4650700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:52:58 XXX radosgw[3894994]: 2021-02-18T12:52:58.695+0100 7f10c9e3b700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:00 XXX radosgw[3894994]: 2021-02-18T12:53:00.059+0100 7f11116ca700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:01 XXX radosgw[3894994]: 2021-02-18T12:53:01.375+0100 7f10cb63e700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:03 XXX radosgw[3894994]: 2021-02-18T12:53:03.039+0100 7f115a75c700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:03 XXX radosgw[3894994]: 2021-02-18T12:53:03.599+0100 7f10bd622700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:07 XXX radosgw[3894994]: 2021-02-18T12:53:07.304+0100 7f11076b6700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:11 XXX radosgw[3894994]: 2021-02-18T12:53:11.964+0100 7f11a77f6700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:12 XXX radosgw[3894994]: 2021-02-18T12:53:12.852+0100 7f1283fe7700 2 RGWDataChangesLog::ChangesRenewThread: start Feb 18 12:53:13 XXX radosgw[3894994]: 2021-02-18T12:53:13.520+0100 7f1123eef700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:14 XXX radosgw[3894994]: 2021-02-18T12:53:14.412+0100 7f119b7de700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.436+0100 7f10ca63c700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_ACCEPT_ENCODING=identity Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_AUTHORIZATION=XXXX Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_HOST=XXX:7480 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_USER_AGENT=Boto3/1.17.10 Python/3.6.4 Linux/5.8.0-41-generic Botocore/1.20.10 Resource Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_VERSION=1.1 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_X_AMZ_CONTENT_SHA256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 HTTP_X_AMZ_DATE=20210218T115315Z Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 REMOTE_ADDR=XXX Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 REQUEST_METHOD=GET Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 REQUEST_URI=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 SCRIPT_URI=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 SERVER_PORT=7480 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 1 ====== starting new request req=0x7f12b833d670 ===== Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 2 req 233058 0s initializing for trans_id = tx000000000000000038e62-00602e552b-15a727be-XXX Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 rgw api priority: s3=8 s3website=7 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 host=XXX Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 subdomain= domain=XXX in_hosted_domain=1 in_hosted_domain_s3website=0 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 final domain/bucket subdomain= domain=XXX in_hosted_domain=1 in_hosted_domain_s3website=0 s->info.domain=XXX s->info.request_uri=/corpus-commoncrawl-main-2020-34/crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 meta>> HTTP_X_AMZ_CONTENT_SHA256 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 meta>> HTTP_X_AMZ_DATE Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 x>> x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 x>> x-amz-date:20210218T115315Z Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s get_handler handler=22RGWHandler_REST_Obj_S3 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 handler=22RGWHandler_REST_Obj_S3 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 2 req 233058 0s getting op 0 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 req 233058 0s s3:get_obj scheduling with dmclock client=2 cost=1 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 op=21RGWGetObj_ObjStore_S3 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 2 req 233058 0s s3:get_obj verifying requester Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::StrategyRegistry::s3_main_strategy_t: trying rgw::auth::s3::AWSAuthStrategy Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::AWSAuthStrategy: trying rgw::auth::s3::S3AnonymousEngine Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::S3AnonymousEngine denied with reason=-1 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::AWSAuthStrategy: trying rgw::auth::s3::LocalEngine Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 v4 signature format = e2fc99164bbcd9c439c7841ab25f678821ce6bb9cee8938b23e9f74b74af3a68 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 v4 credential format = XXX/20210218/us-east-1/s3/aws4_request Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 access key id = XXX Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 credential scope = 20210218/us-east-1/s3/aws4_request Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 req 233058 0s canonical headers format = host:XXX:7480 XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 date_k = c5d419e31f0abe26c7595735635be7bc83dfbf5b97f0c2cbfa20875853611e21 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 region_k = 56d1c1befda3c48b5dc455677b758eb2d17ee845d7def87997a24aff88a7df61 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 service_k = 5874b3496d084572aad5ea9917212d828b234b37e53c3c7d0d60e70956fe301d Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 signing_k = 328889a62e5082ac25d5a2e782697b8ed705faf400668a82384092f659663e45 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 generated signature = e2fc99164bbcd9c439c7841ab25f678821ce6bb9cee8938b23e9f74b74af3a68 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 req 233058 0s s3:get_obj string_to_sign=AWS4-HMAC-SHA256 20210218T115315Z 20210218/us-east-1/s3/aws4_request 5b961df2c7c84779d8b877e6b2c3b81a6df454127677eba711e6acb79fc0e074 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 req 233058 0s s3:get_obj server signature=e2fc99164bbcd9c439c7841ab25f678821ce6bb9cee8938b23e9f74b74af3a68 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 req 233058 0s s3:get_obj client signature=e2fc99164bbcd9c439c7841ab25f678821ce6bb9cee8938b23e9f74b74af3a68 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 req 233058 0s s3:get_obj compare=0 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::LocalEngine granted access Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 req 233058 0s s3:get_obj rgw::auth::s3::AWSAuthStrategy granted access Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 2 req 233058 0s s3:get_obj normalizing buckets and tenants Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 s->object=crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz s->bucket=corpus-commoncrawl-main-2020-34 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 2 req 233058 0s s3:get_obj init permissions Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_system_obj_state: rctx=0x7f12b833ab48 obj=XXX.rgw.meta:root:corpus-commoncrawl-main-2020-34 state=0x7f105c05d2d0 s->prefetch_data=0 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 cache get: name=XXX.rgw.meta+root+corpus-commoncrawl-main-2020-34 : hit (requested=0x16, cached=0x17) Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_system_obj_state: s->obj_tag was set empty Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 cache get: name=XXX.rgw.meta+root+corpus-commoncrawl-main-2020-34 : hit (requested=0x11, cached=0x17) Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 15 decode_policy Read AccessControlPolicy<AccessControlPolicy xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>XXX-corpora</ID><DisplayName>XXX Corpora</DisplayName></Owner><AccessControlList><Grant><Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="CanonicalUser"><ID>XXX-corpora</ID><DisplayName>XXX Corpora</DisplayName></Grantee><Permission>FULL_CONTROL</Permission></Grant></AccessControlList></AccessControlPolicy> Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_system_obj_state: rctx=0x7f12b833af88 obj=XXX.rgw.meta:users.uid:XXX-corpora state=0x7f105c05d2d0 s->prefetch_data=0 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 cache get: name=XXX.rgw.meta+users.uid+XXX-corpora : hit (requested=0x6, cached=0x17) Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_system_obj_state: s->obj_tag was set empty Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 Read xattr: user.rgw.idtag Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 10 cache get: name=XXX.rgw.meta+users.uid+XXX-corpora : hit (requested=0x3, cached=0x17) Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 2 req 233058 0s s3:get_obj recalculating target Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 2 req 233058 0s s3:get_obj reading permissions Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 get_obj_state: rctx=0x7f12b833c290 obj=corpus-commoncrawl-main-2020-34:crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz state=0x7f105c094058 s->prefetch_data=1 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.780+0100 7f118dfc3700 20 WARNING: blocking librados call Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 10 manifest: total_size = 897189304 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 get_obj_state: setting s->obj_tag to 6112a2c2-2570-4c6a-8e57-709633df12eb.297104261.793585 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 15 decode_policy Read AccessControlPolicy<AccessControlPolicy xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>XXX-corpora</ID><DisplayName>XXX Corpora</DisplayName></Owner><AccessControlList><Grant><Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="CanonicalUser"><ID>XXX-corpora</ID><DisplayName>XXX Corpora</DisplayName></Grantee><Permission>FULL_CONTROL</Permission></Grant></AccessControlList></AccessControlPolicy> Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 2 req 233058 0.020000132s s3:get_obj init op Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 2 req 233058 0.020000132s s3:get_obj verifying op mask Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 req 233058 0.020000132s s3:get_obj required_mask= 1 user.op_mask=7 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 2 req 233058 0.020000132s s3:get_obj verifying op permissions Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 req 233058 0.020000132s s3:get_obj -- Getting permissions begin with perm_mask=49 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 5 req 233058 0.020000132s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=XXX-corpora, acct_name=XXX Corpora, subuser=, perm_mask=15, is_admin=0) mask=49 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 5 Searching permissions for uid=XXX-corpora Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 5 Found permission: 15 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 5 Searching permissions for group=1 mask=49 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 5 Permissions for group not found Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 5 Searching permissions for group=2 mask=49 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 5 Permissions for group not found Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 5 req 233058 0.020000132s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=XXX-corpora, acct_name=XXX Corpora, subuser=, perm_mask=15, is_admin=0), owner=XXX-corpora, perm=1 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 10 req 233058 0.020000132s s3:get_obj identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=XXX-corpora, acct_name=XXX Corpora, subuser=, perm_mask=15, is_admin=0) requested perm (type)=1, policy perm=1, user_perm_mask=15, acl perm=1 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 2 req 233058 0.020000132s s3:get_obj verifying op params Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 2 req 233058 0.020000132s s3:get_obj pre-executing Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 2 req 233058 0.020000132s s3:get_obj executing Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 get_obj_state: rctx=0x7f12b833c290 obj=corpus-commoncrawl-main-2020-34:crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz state=0x7f105c094058 s->prefetch_data=1 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.acl Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.compression Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.content_type Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.etag Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.idtag Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.manifest Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.pg_ver Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.source_zone Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.tail_tag Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.x-amz-content-sha256 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.x-amz-date Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.800+0100 7f118dfc3700 20 Read xattr: user.rgw.x-amz-decoded-content-length Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.804+0100 7f118dfc3700 15 Encryption mode: Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.804+0100 7f118dfc3700 20 get_obj_state: rctx=0x7f12b833c290 obj=corpus-commoncrawl-main-2020-34:crawl-data/CC-MAIN-2020-34/segments/1596439735792.85/warc/CC-MAIN-20200803083123-20200803113123-00024.warc.gz state=0x7f105c094058 s->prefetch_data=1 Feb 18 12:53:15 XXX radosgw[3894994]: 2021-02-18T12:53:15.804+0100 7f118dfc3700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 12:53:16 XXX radosgw[3894994]: 2021-02-18T12:53:16.660+0100 7f10fa69c700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920
The error messages are different, but in Boto3, they both result in the same IncompleteReadError exception with the following message:
IncompleteReadError: 3674448 read, but total bytes expected is 930282665. IncompleteReadError: 3322872 read, but total bytes expected is 908263506.
As I said, the byte offsets differ every time, but they are always about 3.5MB.
Downloading said files individually is no problem at all and the log is simply filled with a bunch of these okay-looking status messages:
Feb 18 13:06:45 XXX radosgw[3894994]: 2021-02-18T13:06:45.513+0100 7f1110ec9700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 13:06:45 XXX radosgw[3894994]: 2021-02-18T13:06:45.557+0100 7f10dce61700 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1 Feb 18 13:06:45 XXX radosgw[3894994]: 2021-02-18T13:06:45.557+0100 7f10dce61700 20 RGWObjManifest::operator++(): result: ofs=296632320 stripe_ofs=296632320 part_ofs=0 rule->part_size=0 Feb 18 13:06:45 XXX radosgw[3894994]: 2021-02-18T13:06:45.557+0100 7f10dce61700 20 rados->get_obj_iterate_cb oid=6112a2c2-2570-4c6a-8e57-709633df12eb.297104279.2__shadow_.B3JcFV1w2DcBXfGlYXWF6nPQfnXaIVs_71 obj-ofs=296632320 read_ofs=0 len=4177920
Any ideas what might be the reason? From what I can tell, the issue seems to be load- or latency-induced to some extent. At least the first file consistently gets a timeout error somewhere. There may be a knob to turn for this. The message for the second file looks more serious, but I can also download that file individually without issues.
And here's the thing that makes the least sense: I seem to be able to reproduce it only with a Python 3.6 client and not with a Python 3.8 client (same Boto3 version). At least I am not getting an exception. However, I do see a few of these messages scroll by in the RadosGW log:
Feb 18 13:30:45 XXX radosgw[3894994]: 2021-02-18T13:30:45.430+0100 7f1140728700 1 failed to read header: The socket was closed due to a timeout Feb 18 13:30:45 XXX radosgw[3894994]: 2021-02-18T13:30:45.430+0100 7f1140728700 5 failed to write response: Bad file descriptor Feb 18 13:30:45 XXX radosgw[3894994]: 2021-02-18T13:30:45.430+0100 7f1140728700 1 ====== req done http_status=400 ====== Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.210+0100 7f11beffd700 20 processing logshard = reshard.0000000000 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.214+0100 7f11beffd700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000000 ret=-16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.214+0100 7f11beffd700 5 process_single_logshard(): failed to acquire lock on reshard.0000000000, ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.214+0100 7f11beffd700 20 finish processing logshard = reshard.0000000000 , ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.214+0100 7f11beffd700 20 processing logshard = reshard.0000000001 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.218+0100 7f11beffd700 20 finish processing logshard = reshard.0000000001 , ret = 0 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.218+0100 7f11beffd700 20 processing logshard = reshard.0000000002 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.222+0100 7f11beffd700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000002 ret=-16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.222+0100 7f11beffd700 5 process_single_logshard(): failed to acquire lock on reshard.0000000002, ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.222+0100 7f11beffd700 20 finish processing logshard = reshard.0000000002 , ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.222+0100 7f11beffd700 20 processing logshard = reshard.0000000003 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.226+0100 7f11beffd700 20 finish processing logshard = reshard.0000000003 , ret = 0 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.226+0100 7f11beffd700 20 processing logshard = reshard.0000000004 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.230+0100 7f11beffd700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000004 ret=-16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.230+0100 7f11beffd700 5 process_single_logshard(): failed to acquire lock on reshard.0000000004, ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.230+0100 7f11beffd700 20 finish processing logshard = reshard.0000000004 , ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.230+0100 7f11beffd700 20 processing logshard = reshard.0000000005 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 20 finish processing logshard = reshard.0000000005 , ret = 0 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 20 processing logshard = reshard.0000000006 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000006 ret=-16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 5 process_single_logshard(): failed to acquire lock on reshard.0000000006, ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 20 finish processing logshard = reshard.0000000006 , ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.234+0100 7f11beffd700 20 processing logshard = reshard.0000000007 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.238+0100 7f11beffd700 20 finish processing logshard = reshard.0000000007 , ret = 0 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.238+0100 7f11beffd700 20 processing logshard = reshard.0000000008 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.242+0100 7f11beffd700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000008 ret=-16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.242+0100 7f11beffd700 5 process_single_logshard(): failed to acquire lock on reshard.0000000008, ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.242+0100 7f11beffd700 20 finish processing logshard = reshard.0000000008 , ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.242+0100 7f11beffd700 20 processing logshard = reshard.0000000009 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.246+0100 7f11beffd700 20 finish processing logshard = reshard.0000000009 , ret = 0 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.246+0100 7f11beffd700 20 processing logshard = reshard.0000000010 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.250+0100 7f11beffd700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000010 ret=-16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.250+0100 7f11beffd700 5 process_single_logshard(): failed to acquire lock on reshard.0000000010, ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.250+0100 7f11beffd700 20 finish processing logshard = reshard.0000000010 , ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.250+0100 7f11beffd700 20 processing logshard = reshard.0000000011 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.254+0100 7f11beffd700 20 finish processing logshard = reshard.0000000011 , ret = 0 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.254+0100 7f11beffd700 20 processing logshard = reshard.0000000012 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.258+0100 7f11beffd700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000012 ret=-16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.258+0100 7f11beffd700 5 process_single_logshard(): failed to acquire lock on reshard.0000000012, ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.258+0100 7f11beffd700 20 finish processing logshard = reshard.0000000012 , ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.258+0100 7f11beffd700 20 processing logshard = reshard.0000000013 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.282+0100 7f11beffd700 20 finish processing logshard = reshard.0000000013 , ret = 0 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.282+0100 7f11beffd700 20 processing logshard = reshard.0000000014 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.286+0100 7f11beffd700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000014 ret=-16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.286+0100 7f11beffd700 5 process_single_logshard(): failed to acquire lock on reshard.0000000014, ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.286+0100 7f11beffd700 20 finish processing logshard = reshard.0000000014 , ret = -16 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.286+0100 7f11beffd700 20 processing logshard = reshard.0000000015 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.290+0100 7f11beffd700 20 finish processing logshard = reshard.0000000015 , ret = 0 Feb 18 13:30:47 XXX radosgw[3894994]: 2021-02-18T13:30:47.786+0100 7f1169f7b700 10 Compression for rgw is enabled, decompress part bl_ofs=04177920 Feb 18 13:30:48 XXX radosgw[3894994]: 2021-02-18T13:30:48.766+0100 7f112f706700 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1 Feb 18 13:30:48 XXX radosgw[3894994]: 2021-02-18T13:30:48.766+0100 7f112f706700 20 RGWObjManifest::operator++(): result: ofs=33423360 stripe_ofs=33423360 part_ofs=0 rule->part_size=0 Feb 18 13:30:48 XXX radosgw[3894994]: 2021-02-18T13:30:48.766+0100 7f112f706700 20 rados->get_obj_iterate_cb oid=6112a2c2-2570-4c6a-8e57-709633df12eb.297104279.2__shadow_.NI8Cv6rnvmenLNaXv9ceAdZSqTpATLa_8 obj-ofs=33423360 read_ofs=0 len=4177920
There may be a totally benign reason for the difference between Python 3.6 and 3.8, but I don't know which one that would be, maybe a more forgiving urllib3.
This issue may be related: https://tracker.ceph.com/issues/47722
Updated by Janek Bevendorff about 3 years ago
Now that I have debugged this enough to write a detailed bug report, I am starting to believe that this is indeed a Python 3.6 client issue and the errors are simply client errors logged on server side. When I prematurely close a client connection, I am seeing the same iterate_obj errors.
In that case it could simply be a general latency error which isn't handled well by the client. Is that a possible reason?
Updated by Casey Bodley over 2 years ago
- Status changed from Need More Info to Closed
Janek Bevendorff wrote:
Now that I have debugged this enough to write a detailed bug report, I am starting to believe that this is indeed a Python 3.6 client issue and the errors are simply client errors logged on server side. When I prematurely close a client connection, I am seeing the same iterate_obj errors.
In that case it could simply be a general latency error which isn't handled well by the client. Is that a possible reason?
this does sound like a client or network issue outside of rgw, especially because rgw is closing the connection due to timeout:
write_data failed: The socket was closed due to a timeout