Project

General

Profile

Actions

Bug #22225

open

rgw:socket leak in s3 multi part upload

Added by Ross Martyn over 6 years ago. Updated almost 6 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Kraken : 11.2.1
radosgw: S3 API.

When using rclone and the Ceph S3 API, we saw XML errors not long after
first use.

Failed to copy: SerializationError: failed to decode REST XML response
caused by: XML syntax error on line 1: illegal character code U+0017

Using ‘rclone —dump-headers —dump-bodies’ we were able to identify that
the XML returned included non UTF-8 encoded data in the UploadID field.

{{{
<InitiateMultipartUploadResult
xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Tenant>tennantID</Tenant><Bucket>Photos_dump4</Bucket><Key>Photos/IMG_8288.CR2</Key><UploadId>2~<E0>9
<EC><94> V</UploadId></InitiateMultipartUploadResult>
2017/11/15 09:52:34 DEBUG :
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2017/11/15 09:52:34 ERROR : PhotosIMG_8288.CR2: Failed to copy:
SerializationError: failed to decode REST XML response
caused by: XML syntax error on line 1: invalid UTF-8
}}}

We have noticed that this happens to be during multipart uploads where
random ID’s are generated.

We were able to trace this to the API not being able to generate randomness.

{{{
Nov 15 09:41:39 rgw1 radosgw: 2017-11-15 09:52:34 7fca2b776700 -1 cannot
get random bytes: (24) Too many open files
}}}

After researching in the OS, we noticed that we had ran out of file
descriptors, seemingly being used up by open sockets to the OSD’s.

lsof -p 1427 (radosgw) |wc -l
~1100

As an example, a specific OSD host with 11 disks in has 115 sockets open.

lsof -p 456|awk '{print $9}'|grep 10.10.10.32 |wc -l
115

A restart of the API cleared up these leaked sockets and we are able to
replicate this repeatedly using the S3 API. These sockets were not
automatically cleaned up when the API load was stopped.

We believe this to be a socket leak.


Related issues 1 (0 open1 closed)

Related to rgw - Bug #21401: rgw: Missing error handling when gen_rand_alphanumeric is failingResolvedCasey Bodley09/15/2017

Actions
Actions

Also available in: Atom PDF