RGW allows users to create buckets and objects with invalid names
From the "Amazon Simple Storage Service Developer Guide", API Version 2006-03-01: ("Object Key and Metadata")
"The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long."
We need to make sure all key names and bucket names are valid UTF-8, with an encoding no longer than 1024 characters. Two good libraries to do this are Glib::ustring or libICU. Unfortunately, RGW currently has no knowledge of unicode at all!
This is a serious problem for several reasons:
1. S3 clients will choke on getting a bucket-list that has invalid key names in it.
For example, try creating an object with a control character in the name. After one of these baddies has been created in an RGW bucket, it's impossible to list the objects in this bucket. You can get back a list of object names, but your s3 client will choke on it. It's also impossible to destroy such an object because you cannot send a properly encoded XML message talking about it. The only solution is to destroy the bucket and start over.
I haven't tried this with a bucket name, but I suspect that it would be even worse, since there's no "destroy all buckets" command to save you.
2. echoing filenames containing control characters can cause major security holes
(see http://seclists.org/fulldisclosure/2003/Feb/att-341/Termulation.txt for an example)
None of these security issues can bite you if you just use UTF-8 like you're supposed to.
Under certain conditions, RGW will need to sanitize the returned headers so that they are ASCII:
> When metadata is retrieved through the REST API, Amazon S3 combines > headers that have the same name (ignoring case) into a comma-delimited > list. If some metadata contains unprintable characters, it is not > returned. Instead, the "x-amz-missing-meta" header is returned with a > value of the number of the unprintable metadata entries. Each name, value > pair must conform to US-ASCII when using REST and UTF-8 when using SOAP or > browser-based uploads via POST.
That's US-ASCII, not UTF-8. Sorry.
And yet another thing! We need to do XML escaping on character such as "<" and ">" (and possibly others; it should be in the XML RFC I guess.)
It seems like we have a lot of work to do here.
#2 Updated by Colin McCabe over 9 years ago
I'm not having any luck creating unicode bucket names
cmccabe@metropolis:~/src/ceph/src$ s3amazon create cr?zy WARNING: Bucket name is not valid for virtual-host style URI access. Bucket not created. Use -f option to force the bucket to be created despite this warning.
but then -f just doesn't work either... with
I think maybe we can constrain bucket names to good old ASCII? I need to find out what the real constraints are.