Bug #55521: bluestore zero block detection makes all-zeroes writes deallocate space - bluestore - Ceph

Actions

Copy link

Bug #55521

closed

bluestore zero block detection makes all-zeroes writes deallocate space

Added by Ilya Dryomov almost 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Laura Flores

Target version:

% Done:

100%

Source:

Tags:

Backport:

quincy

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

46193

Crash signature (v1):

Crash signature (v2):

Description

This is completely contrary to the original RFE to have a writesame of zeroes be treated as a request to allocate the specified amount of space but avoid the need to actually zero the space (i.e. track that the extent is in-use but flag it as being zeroed/uninitialized). The main desire there was to optimize "rbd create --thick-provision" which is supposed to allocate space.

In pacific:

$ rbd create --size 1G thinimage
$ rbd create --size 1G --thick-provision thickimage
$ rbd du
NAME        PROVISIONED  USED 
thickimage        1 GiB  1 GiB
thinimage         1 GiB    0 B
<TOTAL>           2 GiB  1 GiB

$ ceph df
[...]
--- POOLS ---
POOL  ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
rbd    1   32  1.0 GiB      265  1.0 GiB  20.24    3.9 GiB

In quincy, as a result of https://github.com/ceph/ceph/pull/43337:

$ rbd create --size 1G thinimage
$ rbd create --size 1G --thick-provision thickimage
$ rbd du
NAME        PROVISIONED  USED 
thickimage        1 GiB  1 GiB
thinimage         1 GiB    0 B
<TOTAL>           2 GiB  1 GiB

$ ceph df
[...]
--- POOLS ---
POOL  ID  PGS  STORED  OBJECTS    USED  %USED  MAX AVAIL
rbd    1    1   247 B      264  20 KiB      0    4.9 GiB

The backing objects for thickimage are still created (as shown by "rbd du") but it is not actually thick (as shown by "ceph df") -- an unlimited number of such images can be created in a cluster with a single 5G block device.

This regression also affects existing thick images as, even though users disable discard/trim for such images, all-zeroes writes of various sizes which occur from time to time in pretty much any system would now slowly thin them.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Assignee set to Laura Flores

Actions

Copy link

Updated by Laura Flores almost 2 years ago

I think I understand what's missing.

Currently, we punch a hole in the extent map for every bufferlist. Then, if the bufferlist is not zero, we write it to the disk. But if the bufferlist is zero, we omit writing it to the disk. So it looks something like this:

punch_hole(extent_map)
if (bufferlist is not zero)
   write_to_disk(bufferlist)
else
   do nothing

To follow fallocate()'s FALLOC_FL_ZERO_RANGE, it should be implemented like this:

if (bufferlist is not zero)
   punch_hole(extent_map)
   write_to_disk(bufferlist)
else
   flag_for_use(extent_map)

Does that sound right to you Ilya? I will also run this by Adam Kupczyk since he has a lot of knowledge in this area of the code.

Actions

Copy link

Updated by Neha Ojha almost 2 years ago

Project changed from RADOS to bluestore
Category deleted (~~Correctness/Safety~~)

Actions

Copy link

Updated by Ilya Dryomov almost 2 years ago

Hi Laura,

To follow fallocate()'s FALLOC_FL_ZERO_RANGE, it should be implemented like this:
if (bufferlist is not zero)
   punch_hole(extent_map)
   write_to_disk(bufferlist)
else
   flag_for_use(extent_map)

Make sure to account for the case where the bufferlist is zero and there is already some (most likely non-zero) data allocated in that range. Zeroes should always be returned after the skipped write, so you would probably want to punch the hole in that case too and then flag as allocated-but-unwritten.

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Project changed from bluestore to RADOS
Severity deleted (~~3 - minor~~)

Ilya Dryomov wrote:

Make sure to account for the case where the bufferlist is zero and there is already some (most likely non-zero) data allocated in that range. Zeroes should always be returned after the skipped write, so you would probably want to punch the hole in that case too and then flag as allocated-but-unwritten.

That makes sense. Working on a fix on my end.

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Project changed from RADOS to bluestore
Severity set to 3 - minor

Not sure why those fields got changed. Setting them back to what Neha specified in note#3.

Actions

Copy link

Updated by Neha Ojha almost 2 years ago

Backport set to quincy

Actions

Copy link

Updated by Laura Flores almost 2 years ago

@Ilya can we have this fix ready for the next Quincy point release, or do you need the fix now? This will help decide whether a revert is warranted, or if we can simply include a note in the current Quincy release notes saying that this is a known issue, and that it will be addressed in the next point release.

Actions

Copy link

#10

Updated by Ilya Dryomov almost 2 years ago

Whether a fix or a revert, it's not going to reach users until the first point release, so either is fine with me. If the fix isn't ready in time for the first point release, I would expect a revert.

Actions

Copy link

#11

Updated by Adam Kupczyk almost 2 years ago

PR https://github.com/ceph/ceph/pull/43337 was based on assumption that BlueStore internal data representation is opaque.
For the client (OSD is ObjectStore's client) it was irrelevant how the data is stored for as long as the requests were handled correctly.
The stats retrieved from the store reflected its internal state.

Now it seems we want it to be a bit different.
We expect that an action will have results outside ObjectStore contract.

The expected new behavior is:
- when writing zeros with hint "PROVISIONING", do not write data, but allocate disk space and mark region as "ZEROS"

The intended result is:
- stats are properly tracked, so that free space is reduced when thick-provisioned objects are placed

I think that to make this consistent we must take care of:
1) Move of thick-provisioned object to other OSD must preserve thick-provisioned behavior
2) Compression of objects created by thick-provisioning be disabled
3) Cloning thick-provisioned regions ("ZEROS") should actually re-allocate

Please note that when writing to objects BlueStore is not using already allocated space but
always allocates new space and releases previous one. This means that thick-provisioned space will not be used even once.

Actions

Copy link

#12

Updated by Ilya Dryomov almost 2 years ago

Adam Kupczyk wrote:

PR https://github.com/ceph/ceph/pull/43337 was based on assumption that BlueStore internal data representation is opaque.
For the client (OSD is ObjectStore's client) it was irrelevant how the data is stored for as long as the requests were handled correctly.
The stats retrieved from the store reflected its internal state.

Now it seems we want it to be a bit different.
We expect that an action will have results outside ObjectStore contract.

Hi Adam,

Yeah, I think part of the problem is that it's not clear what that contract is and when compression was introduced it muddied the waters even further. librados API, which is the interface that the end user is presented with, doesn't document these semantics at all. At lower layers but still above the actual backing store (CEPH_OSD_OP ops, PGTransaction, Transaction etc) there are some scattered comments that actually directed the implementation of "rbd create --thick-provision" (in Mimic IIRC):

  /**
   * Write data to an offset within an object. If the object is too
   * small, it is expanded as needed.  It is possible to specify an
   * offset beyond the current end of an object and it will be
   * expanded as needed. Simple implementations of ObjectStore will
   * just zero the data between the old end of the object and the
   * newly provided data. More sophisticated implementations of
   * ObjectStore will omit the untouched data and store it as a
   * "hole" in the file.
   *
   * Note that a 0-length write does not affect the size of the object.
   */
  void write(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len,
               const ceph::buffer::list& write_data, uint32_t flags = 0) {

  /**
   * zero out the indicated byte range within an object. Some
   * ObjectStore instances may optimize this to release the
   * underlying storage space.
   *
   * If the zero range extends beyond the end of the object, the object
   * size is extended, just as if we were writing a buffer full of zeros.
   * EXCEPT if the length is 0, in which case (just like a 0-length write)
   * we do not adjust the object size.
   */
  void zero(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len) {

The possibility of releasing the underlying storage space is noted for OP_ZERO but not for OP_WRITE. This is why librbd happily uses OP_ZERO for discard/TRIM and sticks with OP_WRITE for explicit zeroing.

In general, most of these deeply ingrained assumptions go back to FileStore days. Note that some of these comments still talk about files ;)

Backing store compression messes with a lot such assumptions but, for "rbd create --thick-provision", I think it is relatively known that it falls over in that case. The new zero block detection is much worse though as it is enabled by default and there is no configuration option to disable it.

The expected new behavior is:
- when writing zeros with hint "PROVISIONING", do not write data, but allocate disk space and mark region as "ZEROS"

Currently there is no PROVISIONING hint. Do you intend to introduce one?

The intended result is:
- stats are properly tracked, so that free space is reduced when thick-provisioned objects are placed

I think that to make this consistent we must take care of:
1) Move of thick-provisioned object to other OSD must preserve thick-provisioned behavior

Wasn't that true prior to https://github.com/ceph/ceph/pull/43337? IIRC it is based on ObjectStore::fiemap which, prior to https://github.com/ceph/ceph/pull/43337, returned a full-sized extent for a bunch of explicitly written zeroes. Or am I misremembering?

2) Compression of objects created by thick-provisioning be disabled
3) Cloning thick-provisioned regions ("ZEROS") should actually re-allocate

IMO cloning in general shouldn't change anything about the cloned region. If it is compressed, it should stay that way. If it isn't compressed, it should say that way. If it is allocated-but-unwritten (what you refer to as "ZEROS" above), it should stay that way. Finally, if it is a bunch of explicitly written zeroes, it should stay that way too.

Please note that when writing to objects BlueStore is not using already allocated space but
always allocates new space and releases previous one. This means that thick-provisioned space will not be used even once.

Yup, this is understood -- and, as long as the previously allocated space is "carried over" (i.e. the same amount of space gets allocated in the new location), is the expected behavior.

Actions

Copy link

#13

Updated by Laura Flores almost 2 years ago

@Ilya just had a chat with Adam, and we think that the best thing for an immediate fix for this is to implement a global option to toggle the feature off by default. This way, we could keep the feature in the codebase and build on it to meet RBD requirements. Let me know what you think.

Actions

Copy link

#14