previous/current on-disk format: each bucket is a rados pool. all rados objects are of the form [__namespace__]oid, where the oid is the S3 oid the user provides. Most objects do not have a namespace. Each S3 object has a tag, which is a random sequence of bits (32 bits?). The tag is newly-generated each time the object is modified. Whenever an S3 object is modified, it first clones the rados object into a new object named with the form __shadow__oid_tag, where tag is the tag of the original object. (shadow objects do not themselves get a tag.) These shadow objects are cleaned up by something external to RGW Other valid namespaces are __multipart__ and __tmp__. These are used for the multi-part upload format (and then cloned into place in the real object), and for non-multi-part object uploads, respectively. Neither of these get tags. Multipart-upload objects also get a metadata object. (tmp objects are lost if the upload fails somehow! We should handle this somewhere, probably in the same tool that handles cleanup of shadow objects.) new in-development format: We have a controlled number of pools. At present, buckets are simply placed randomly into a pool -- in the future, we can have a more advanced layout policy based on replication, location, or speed requirements (or whatever). Each bucket gets assigned a unique ID. At present these are based on getting the pg version by no-oping a specific object and are unique within a pool, but not across pools. {To facilitate migrations and simplify administration, these should probably be made globally unique within our system -- but for now, the pool and bucket ID can be used to form a globally unique identifier.} All S3 objects are stored in rados objects with a name of the form bucketid_[__namespace__]oid. Shadow and temporary objects work as they did previously. However, file updates are now done quite differently. We maintain an index object containing a list of all objects which exist or might exist, keyed by oid. Each oid has an associated size, mtime, object locator, etc. In addition to this metadata, it maintains: 1) A last-updated version for the object (this is the pg version on the OSDs). 2) a "delete" flag, set to true or false 3) a list of tag and time pairs in a pending list. This index is maintained via a two-phase commit protocol by RGW: Whenever RGW is going to update an object (in a new put, a replacement put, or a deletion), it first sends a prepare op to the index. This op includes the object name (including namespace, but excluding bucket id), tag (for the existing state of the object[1]), and object locator. If there is no existing object, the tag is randomly generated so it can be tracked correctly. The index atomically creates the object in the list (if necessary), adds the tag with the current time, and sends it back to disk. Then RGW does the actual update to the object. Then RGW sends a commit message to the index. This commit contains the oid, old tag, version, and other metadata. The index atomically examines the version, applies the changes if the message's version is newer than the current one, and removes the named tag from the list. If the operation was a delete, then either: the object is removed from the index, if there are no remaining tags; or the delete flag is set, while other tags remain. During listing, if the reader discovers objects with uncommitted operations, it looks at the actual state of the object and reports those results. It then trims any sufficiently old (24 hours?) tag operations out of the index. If there are no remaining tags, it then updates the index to include the actual state of the object. For this purpose, the delete flag is considered an uncommitted operation -- if it is set, the reader must check the state of the object, and remove it from the index if the object no longer exists and there are no in-date tags. (If the object does exist and there are no in-date tags, it must remove the flag.)