Project

General

Profile

Actions

Bug #63017

closed

write_data failed: Connection reset by peer error observed against main while uploading a multipart object

Added by Pritha Srivastava 7 months ago. Updated 6 months ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

write_data failed: Connection reset by peer, was first observed in a 'get object' call while running test_multipart_upload() test against the d3n filter driver branch. Upon investigating this error, it was observed that when there is a latency introduced after the 'get object' statement like 'assert' statements, the error goes away.

I have created a boto3 script using test_multipart_upload() and the first 'get object' call leads to this error in the log file.

I brought up vstart using the following command:
MON=1 OSD=1 RGW=1 MGR=0 MDS=0 ../src/vstart.sh -n -d

I have attached the script here, some parts have been commented out to reproduce the error.

A snippet from the log file is below:
2023-09-28T15:35:10.052+0530 7f4e9b7c86c0 20 req 6030671164491735147 0.005000023s s3:get_obj RGWObjManifest::operator++(): rule->part_size=5242880 rules.size()=1
2023-09-28T15:35:10.052+0530 7f4e9b7c86c0 20 req 6030671164491735147 0.005000023s s3:get_obj RGWObjManifest::operator++(): stripe_ofs=18874368 part_ofs=10485760 rule->part_size=5242880
2023-09-28T15:35:10.052+0530 7f4e9b7c86c0 20 req 6030671164491735147 0.005000023s s3:get_obj RGWObjManifest::operator++(): result: ofs=15728640 stripe_ofs=15728640 part_ofs=15728640 rule->part_size=5242880
2023-09-28T15:35:10.052+0530 7f4e9b7c86c0 20 req 6030671164491735147 0.005000023s s3:get_obj rados->get_obj_iterate_cb oid=38f1b649-d8be-4772-a31a-ed016eba5ea6.4149.1__multipart_mymultipart.2~KCjBpxY4Zb7Z3s_Pw_j9Il1RUe11iXJ.4 obj-ofs=15728640 read_ofs=0 len=4194304
...
2023-09-28T15:35:10.091+0530 7f4e947ba6c0 4 write_data failed: Connection reset by peer
2023-09-28T15:35:10.091+0530 7f4e947ba6c0 0 req 6030671164491735147 0.044000205s s3:get_obj iterate_obj() failed with -104
2023-09-28T15:35:10.091+0530 7f4e947ba6c0 2 req 6030671164491735147 0.044000205s s3:get_obj completing
2023-09-28T15:35:10.091+0530 7f4e947ba6c0 10 req 6030671164491735147 0.044000205s cache get: name=default.rgw.log++script.postrequest. : hit (negative entry)
2023-09-28T15:35:10.091+0530 7f4e947ba6c0 2 req 6030671164491735147 0.044000205s s3:get_obj op status=-104
2023-09-28T15:35:10.091+0530 7f4e947ba6c0 2 req 6030671164491735147 0.044000205s s3:get_obj http status=200
2023-09-28T15:35:10.091+0530 7f4e947ba6c0 1 ====== req done req=0x7f4e69763710 op status=-104 http_status=200 latency=0.044000205s ======


Files

test_multipart.py (4.51 KB) test_multipart.py Pritha Srivastava, 09/28/2023 10:08 AM
Actions #1

Updated by Casey Bodley 7 months ago

  • Status changed from New to Need More Info

hi Pritha, i found that the error goes away when i uncomment the following line:

body = _get_body(response)

without this line, boto doesn't try to read any of the response body from the socket. it just reads the response headers, then closes the socket when the script exits. that's why rgw prints the Connection reset by peer error

i don't think this is a bug in rgw

Actions #2

Updated by Pritha Srivastava 7 months ago

Ok, isn't get_object() itself supposed to return the whole body - I mean I thought boto waits till get_object() call returns the whole body.

In case of d3n filter driver, I observed that the second get_object() had the `write_data: connection_reset` error, there is no _get_body() call after it in the test, so rgw crashes in the destructor of aio since `completed` list was non-empty. I fixed the crash by draining out all the ios in case an error is received by the filter driver during an ongoing read call. But then I saw an error in the subsequent ranged request calls - that is what I will debug next.

Actions #3

Updated by Casey Bodley 7 months ago

Pritha Srivastava wrote:

Ok, isn't get_object() itself supposed to return the whole body - I mean I thought boto waits till get_object() call returns the whole body.

in boto3, get_object() only waits for the response headers. it then returns a StreamingBody interface that the caller can use to read the rest of its response body from the socket at its own pace. see https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/get_object.html

that StreamingBody approach is important for really large objects, so the client can limit how much data it buffers in memory. in contrast, boto2 had a get_contents_as_string() that returned the entire response body as a string: see https://boto.cloudhackers.com/en/latest/ref/s3.html#boto.s3.key.Key.get_contents_as_string

Actions #4

Updated by Pritha Srivastava 7 months ago

Ok, but for objects as small as 3MB I did not see the 'write_data: connection reset' error without the '_get_body(response)' call. Is 'get_object' returning some amount of data in addition to the response headers?

Actions #5

Updated by Casey Bodley 6 months ago

  • Status changed from Need More Info to Won't Fix
Actions

Also available in: Atom PDF