Bug #22225
openrgw:socket leak in s3 multi part upload
0%
Description
Kraken : 11.2.1
radosgw: S3 API.
When using rclone and the Ceph S3 API, we saw XML errors not long after
first use.
Failed to copy: SerializationError: failed to decode REST XML response
caused by: XML syntax error on line 1: illegal character code U+0017
Using ‘rclone —dump-headers —dump-bodies’ we were able to identify that
the XML returned included non UTF-8 encoded data in the UploadID field.
{{{
<InitiateMultipartUploadResult
xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Tenant>tennantID</Tenant><Bucket>Photos_dump4</Bucket><Key>Photos/IMG_8288.CR2</Key><UploadId>2~<E0>9
<EC><94> V</UploadId></InitiateMultipartUploadResult>
2017/11/15 09:52:34 DEBUG :
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2017/11/15 09:52:34 ERROR : PhotosIMG_8288.CR2: Failed to copy:
SerializationError: failed to decode REST XML response
caused by: XML syntax error on line 1: invalid UTF-8
}}}
We have noticed that this happens to be during multipart uploads where
random ID’s are generated.
We were able to trace this to the API not being able to generate randomness.
{{{
Nov 15 09:41:39 rgw1 radosgw: 2017-11-15 09:52:34 7fca2b776700 -1 cannot
get random bytes: (24) Too many open files
}}}
After researching in the OS, we noticed that we had ran out of file
descriptors, seemingly being used up by open sockets to the OSD’s.
lsof -p 1427 (radosgw) |wc -l
~1100
As an example, a specific OSD host with 11 disks in has 115 sockets open.
lsof -p 456|awk '{print $9}'|grep 10.10.10.32 |wc -l
115
A restart of the API cleared up these leaked sockets and we are able to
replicate this repeatedly using the S3 API. These sockets were not
automatically cleaned up when the API load was stopped.
We believe this to be a socket leak.
Updated by Nathan Cutler over 6 years ago
Kraken is EOL. Is the issue reproducible on Luminous or master?
Updated by John Spray over 6 years ago
- Project changed from Ceph to rgw
- Category deleted (
22)
Updated by Matt Benjamin over 6 years ago
We think this is a duplicate of:
http://tracker.ceph.com/issues/21401
(will backport to Luminous)
Updated by Matt Benjamin over 6 years ago
- Status changed from Duplicate to In Progress
Updated by Matt Benjamin over 6 years ago
Sorry, related but not a duplicate. After triage, we're interested in:
1. behavior on L or master
2. Ceph configuratoin, including values for threads and num_rados_handles
Matt
Updated by Nick Janus almost 6 years ago
Matt,
I believe we've run in the same issue:
$ sudo radosgw-admin bucket list --bucket xxx | grep 'docker/registry/v2/repositories/xxx/_uploads/uuid/data'
"name": "_multipart_docker/registry/v2/repositories/xxx/_uploads/uuid/data.2~��������������������������������.meta",
This also results xml parsing errors with S3 cli clients. We also had an issue with rgw running up against its file descriptor limit a couple weeks ago.
1. We're hitting this running 12.2.2
2. rgw_thread_pool_size is 8000 and num_rados_handles is default (1)
This is the config we're using for rgw:
[client.rgw.prod-rgw]
rgw_realm = us
rgw_gc_max_objs = 512
rgw_obj_stripe_size = 20971520
rgw_frontends = "civetweb port=7480 access_log_file=/var/log/ceph/radosgw.access.log error_log_file=/var/log/ceph/radosgw.error.log"
objecter_inflight_ops = 10240
rgw_enable_usage_log = true # we also have nodes without usage logging turned on
objecter_inflight_op_bytes = 2147483648
rgw_enable_ops_log = false
rgw_thread_pool_size = 8000
rgw_dynamic_resharding = true
rgw_enable_apis = s3, admin
rgw_override_bucket_index_max_shards = 32
rgw_resolve_cname = false
rgw_enable_static_website = false
Updated by Casey Bodley almost 6 years ago
- Related to Bug #21401: rgw: Missing error handling when gen_rand_alphanumeric is failing added