Project

General

Profile

Actions

Bug #22225

open

rgw:socket leak in s3 multi part upload

Added by Ross Martyn over 6 years ago. Updated almost 6 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Kraken : 11.2.1
radosgw: S3 API.

When using rclone and the Ceph S3 API, we saw XML errors not long after
first use.

Failed to copy: SerializationError: failed to decode REST XML response
caused by: XML syntax error on line 1: illegal character code U+0017

Using ‘rclone —dump-headers —dump-bodies’ we were able to identify that
the XML returned included non UTF-8 encoded data in the UploadID field.

{{{
<InitiateMultipartUploadResult
xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Tenant>tennantID</Tenant><Bucket>Photos_dump4</Bucket><Key>Photos/IMG_8288.CR2</Key><UploadId>2~<E0>9
<EC><94> V</UploadId></InitiateMultipartUploadResult>
2017/11/15 09:52:34 DEBUG :
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2017/11/15 09:52:34 ERROR : PhotosIMG_8288.CR2: Failed to copy:
SerializationError: failed to decode REST XML response
caused by: XML syntax error on line 1: invalid UTF-8
}}}

We have noticed that this happens to be during multipart uploads where
random ID’s are generated.

We were able to trace this to the API not being able to generate randomness.

{{{
Nov 15 09:41:39 rgw1 radosgw: 2017-11-15 09:52:34 7fca2b776700 -1 cannot
get random bytes: (24) Too many open files
}}}

After researching in the OS, we noticed that we had ran out of file
descriptors, seemingly being used up by open sockets to the OSD’s.

lsof -p 1427 (radosgw) |wc -l
~1100

As an example, a specific OSD host with 11 disks in has 115 sockets open.

lsof -p 456|awk '{print $9}'|grep 10.10.10.32 |wc -l
115

A restart of the API cleared up these leaked sockets and we are able to
replicate this repeatedly using the S3 API. These sockets were not
automatically cleaned up when the API load was stopped.

We believe this to be a socket leak.


Related issues 1 (0 open1 closed)

Related to rgw - Bug #21401: rgw: Missing error handling when gen_rand_alphanumeric is failingResolvedCasey Bodley09/15/2017

Actions
Actions #1

Updated by Nathan Cutler over 6 years ago

Kraken is EOL. Is the issue reproducible on Luminous or master?

Actions #2

Updated by John Spray over 6 years ago

  • Project changed from Ceph to rgw
  • Category deleted (22)
Actions #3

Updated by Matt Benjamin over 6 years ago

We think this is a duplicate of:
http://tracker.ceph.com/issues/21401
(will backport to Luminous)

Actions #4

Updated by Matt Benjamin over 6 years ago

  • Status changed from New to Duplicate
Actions #5

Updated by Matt Benjamin over 6 years ago

  • Status changed from Duplicate to In Progress
Actions #6

Updated by Matt Benjamin over 6 years ago

Sorry, related but not a duplicate. After triage, we're interested in:
1. behavior on L or master
2. Ceph configuratoin, including values for threads and num_rados_handles

Matt

Actions #7

Updated by Nick Janus almost 6 years ago

Matt,

I believe we've run in the same issue:

$ sudo radosgw-admin bucket list --bucket xxx | grep 'docker/registry/v2/repositories/xxx/_uploads/uuid/data'
        "name": "_multipart_docker/registry/v2/repositories/xxx/_uploads/uuid/data.2~��������������������������������.meta",

This also results xml parsing errors with S3 cli clients. We also had an issue with rgw running up against its file descriptor limit a couple weeks ago.

1. We're hitting this running 12.2.2
2. rgw_thread_pool_size is 8000 and num_rados_handles is default (1)

This is the config we're using for rgw:

[client.rgw.prod-rgw]
rgw_realm = us
rgw_gc_max_objs = 512
rgw_obj_stripe_size = 20971520
rgw_frontends = "civetweb port=7480 access_log_file=/var/log/ceph/radosgw.access.log error_log_file=/var/log/ceph/radosgw.error.log" 
objecter_inflight_ops = 10240
rgw_enable_usage_log = true # we also have nodes without usage logging turned on
objecter_inflight_op_bytes = 2147483648
rgw_enable_ops_log = false
rgw_thread_pool_size = 8000
rgw_dynamic_resharding = true
rgw_enable_apis = s3, admin
rgw_override_bucket_index_max_shards = 32
rgw_resolve_cname = false
rgw_enable_static_website = false

Actions #8

Updated by Casey Bodley almost 6 years ago

  • Related to Bug #21401: rgw: Missing error handling when gen_rand_alphanumeric is failing added
Actions

Also available in: Atom PDF