Adam Kupczyk wrote:
PR https://github.com/ceph/ceph/pull/43337 was based on assumption that BlueStore internal data representation is opaque.
For the client (OSD is ObjectStore's client) it was irrelevant how the data is stored for as long as the requests were handled correctly.
The stats retrieved from the store reflected its internal state.
Now it seems we want it to be a bit different.
We expect that an action will have results outside ObjectStore contract.
Hi Adam,
Yeah, I think part of the problem is that it's not clear what that contract is and when compression was introduced it muddied the waters even further. librados API, which is the interface that the end user is presented with, doesn't document these semantics at all. At lower layers but still above the actual backing store (CEPH_OSD_OP ops, PGTransaction, Transaction etc) there are some scattered comments that actually directed the implementation of "rbd create --thick-provision" (in Mimic IIRC):
/**
* Write data to an offset within an object. If the object is too
* small, it is expanded as needed. It is possible to specify an
* offset beyond the current end of an object and it will be
* expanded as needed. Simple implementations of ObjectStore will
* just zero the data between the old end of the object and the
* newly provided data. More sophisticated implementations of
* ObjectStore will omit the untouched data and store it as a
* "hole" in the file.
*
* Note that a 0-length write does not affect the size of the object.
*/
void write(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len,
const ceph::buffer::list& write_data, uint32_t flags = 0) {
/**
* zero out the indicated byte range within an object. Some
* ObjectStore instances may optimize this to release the
* underlying storage space.
*
* If the zero range extends beyond the end of the object, the object
* size is extended, just as if we were writing a buffer full of zeros.
* EXCEPT if the length is 0, in which case (just like a 0-length write)
* we do not adjust the object size.
*/
void zero(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len) {
The possibility of releasing the underlying storage space is noted for OP_ZERO but not for OP_WRITE. This is why librbd happily uses OP_ZERO for discard/TRIM and sticks with OP_WRITE for explicit zeroing.
In general, most of these deeply ingrained assumptions go back to FileStore days. Note that some of these comments still talk about files ;)
Backing store compression messes with a lot such assumptions but, for "rbd create --thick-provision", I think it is relatively known that it falls over in that case. The new zero block detection is much worse though as it is enabled by default and there is no configuration option to disable it.
The expected new behavior is:
- when writing zeros with hint "PROVISIONING", do not write data, but allocate disk space and mark region as "ZEROS"
Currently there is no PROVISIONING hint. Do you intend to introduce one?
The intended result is:
- stats are properly tracked, so that free space is reduced when thick-provisioned objects are placed
I think that to make this consistent we must take care of:
1) Move of thick-provisioned object to other OSD must preserve thick-provisioned behavior
Wasn't that true prior to https://github.com/ceph/ceph/pull/43337? IIRC it is based on ObjectStore::fiemap which, prior to https://github.com/ceph/ceph/pull/43337, returned a full-sized extent for a bunch of explicitly written zeroes. Or am I misremembering?
2) Compression of objects created by thick-provisioning be disabled
3) Cloning thick-provisioned regions ("ZEROS") should actually re-allocate
IMO cloning in general shouldn't change anything about the cloned region. If it is compressed, it should stay that way. If it isn't compressed, it should say that way. If it is allocated-but-unwritten (what you refer to as "ZEROS" above), it should stay that way. Finally, if it is a bunch of explicitly written zeroes, it should stay that way too.
Please note that when writing to objects BlueStore is not using already allocated space but
always allocates new space and releases previous one. This means that thick-provisioned space will not be used even once.
Yup, this is understood -- and, as long as the previously allocated space is "carried over" (i.e. the same amount of space gets allocated in the new location), is the expected behavior.