Project

General

Profile

Bug #55521

bluestore zero block detection makes all-zeroes writes deallocate space

Added by Ilya Dryomov 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
quincy
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is completely contrary to the original RFE to have a writesame of zeroes be treated as a request to allocate the specified amount of space but avoid the need to actually zero the space (i.e. track that the extent is in-use but flag it as being zeroed/uninitialized). The main desire there was to optimize "rbd create --thick-provision" which is supposed to allocate space.

In pacific:

$ rbd create --size 1G thinimage
$ rbd create --size 1G --thick-provision thickimage
$ rbd du
NAME        PROVISIONED  USED 
thickimage        1 GiB  1 GiB
thinimage         1 GiB    0 B
<TOTAL>           2 GiB  1 GiB

$ ceph df
[...]
--- POOLS ---
POOL  ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
rbd    1   32  1.0 GiB      265  1.0 GiB  20.24    3.9 GiB

In quincy, as a result of https://github.com/ceph/ceph/pull/43337:

$ rbd create --size 1G thinimage
$ rbd create --size 1G --thick-provision thickimage
$ rbd du
NAME        PROVISIONED  USED 
thickimage        1 GiB  1 GiB
thinimage         1 GiB    0 B
<TOTAL>           2 GiB  1 GiB

$ ceph df
[...]
--- POOLS ---
POOL  ID  PGS  STORED  OBJECTS    USED  %USED  MAX AVAIL
rbd    1    1   247 B      264  20 KiB      0    4.9 GiB

The backing objects for thickimage are still created (as shown by "rbd du") but it is not actually thick (as shown by "ceph df") -- an unlimited number of such images can be created in a cluster with a single 5G block device.

This regression also affects existing thick images as, even though users disable discard/trim for such images, all-zeroes writes of various sizes which occur from time to time in pretty much any system would now slowly thin them.


Related issues

Copied to bluestore - Backport #55811: quincy: bluestore zero block detection makes all-zeroes writes deallocate space Resolved

History

#1 Updated by Laura Flores 9 months ago

  • Assignee set to Laura Flores

#2 Updated by Laura Flores 9 months ago

I think I understand what's missing.

Currently, we punch a hole in the extent map for every bufferlist. Then, if the bufferlist is not zero, we write it to the disk. But if the bufferlist is zero, we omit writing it to the disk. So it looks something like this:

punch_hole(extent_map)
if (bufferlist is not zero)
   write_to_disk(bufferlist)
else
   do nothing

To follow fallocate()'s FALLOC_FL_ZERO_RANGE, it should be implemented like this:

if (bufferlist is not zero)
   punch_hole(extent_map)
   write_to_disk(bufferlist)
else
   flag_for_use(extent_map)

Does that sound right to you Ilya? I will also run this by Adam Kupczyk since he has a lot of knowledge in this area of the code.

#3 Updated by Neha Ojha 9 months ago

  • Project changed from RADOS to bluestore
  • Category deleted (Correctness/Safety)

#4 Updated by Ilya Dryomov 9 months ago

Hi Laura,

To follow fallocate()'s FALLOC_FL_ZERO_RANGE, it should be implemented like this:

if (bufferlist is not zero)
   punch_hole(extent_map)
   write_to_disk(bufferlist)
else
   flag_for_use(extent_map)

Make sure to account for the case where the bufferlist is zero and there is already some (most likely non-zero) data allocated in that range. Zeroes should always be returned after the skipped write, so you would probably want to punch the hole in that case too and then flag as allocated-but-unwritten.

#5 Updated by Laura Flores 9 months ago

  • Project changed from bluestore to RADOS
  • Severity deleted (3 - minor)

Ilya Dryomov wrote:

Make sure to account for the case where the bufferlist is zero and there is already some (most likely non-zero) data allocated in that range. Zeroes should always be returned after the skipped write, so you would probably want to punch the hole in that case too and then flag as allocated-but-unwritten.

That makes sense. Working on a fix on my end.

#6 Updated by Laura Flores 9 months ago

  • Status changed from New to In Progress

#7 Updated by Laura Flores 9 months ago

  • Project changed from RADOS to bluestore
  • Severity set to 3 - minor

Not sure why those fields got changed. Setting them back to what Neha specified in note#3.

#8 Updated by Neha Ojha 9 months ago

  • Backport set to quincy

#9 Updated by Laura Flores 9 months ago

@Ilya can we have this fix ready for the next Quincy point release, or do you need the fix now? This will help decide whether a revert is warranted, or if we can simply include a note in the current Quincy release notes saying that this is a known issue, and that it will be addressed in the next point release.

#10 Updated by Ilya Dryomov 9 months ago

Whether a fix or a revert, it's not going to reach users until the first point release, so either is fine with me. If the fix isn't ready in time for the first point release, I would expect a revert.

#11 Updated by Adam Kupczyk 9 months ago

PR https://github.com/ceph/ceph/pull/43337 was based on assumption that BlueStore internal data representation is opaque.
For the client (OSD is ObjectStore's client) it was irrelevant how the data is stored for as long as the requests were handled correctly.
The stats retrieved from the store reflected its internal state.

Now it seems we want it to be a bit different.
We expect that an action will have results outside ObjectStore contract.

The expected new behavior is:
- when writing zeros with hint "PROVISIONING", do not write data, but allocate disk space and mark region as "ZEROS"

The intended result is:
- stats are properly tracked, so that free space is reduced when thick-provisioned objects are placed

I think that to make this consistent we must take care of:
1) Move of thick-provisioned object to other OSD must preserve thick-provisioned behavior
2) Compression of objects created by thick-provisioning be disabled
3) Cloning thick-provisioned regions ("ZEROS") should actually re-allocate

Please note that when writing to objects BlueStore is not using already allocated space but
always allocates new space and releases previous one. This means that thick-provisioned space will not be used even once.

#12 Updated by Ilya Dryomov 9 months ago

Adam Kupczyk wrote:

PR https://github.com/ceph/ceph/pull/43337 was based on assumption that BlueStore internal data representation is opaque.
For the client (OSD is ObjectStore's client) it was irrelevant how the data is stored for as long as the requests were handled correctly.
The stats retrieved from the store reflected its internal state.

Now it seems we want it to be a bit different.
We expect that an action will have results outside ObjectStore contract.

Hi Adam,

Yeah, I think part of the problem is that it's not clear what that contract is and when compression was introduced it muddied the waters even further. librados API, which is the interface that the end user is presented with, doesn't document these semantics at all. At lower layers but still above the actual backing store (CEPH_OSD_OP ops, PGTransaction, Transaction etc) there are some scattered comments that actually directed the implementation of "rbd create --thick-provision" (in Mimic IIRC):

  /**
   * Write data to an offset within an object. If the object is too
   * small, it is expanded as needed.  It is possible to specify an
   * offset beyond the current end of an object and it will be
   * expanded as needed. Simple implementations of ObjectStore will
   * just zero the data between the old end of the object and the
   * newly provided data. More sophisticated implementations of
   * ObjectStore will omit the untouched data and store it as a
   * "hole" in the file.
   *
   * Note that a 0-length write does not affect the size of the object.
   */
  void write(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len,
               const ceph::buffer::list& write_data, uint32_t flags = 0) {
  /**
   * zero out the indicated byte range within an object. Some
   * ObjectStore instances may optimize this to release the
   * underlying storage space.
   *
   * If the zero range extends beyond the end of the object, the object
   * size is extended, just as if we were writing a buffer full of zeros.
   * EXCEPT if the length is 0, in which case (just like a 0-length write)
   * we do not adjust the object size.
   */
  void zero(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len) {

The possibility of releasing the underlying storage space is noted for OP_ZERO but not for OP_WRITE. This is why librbd happily uses OP_ZERO for discard/TRIM and sticks with OP_WRITE for explicit zeroing.

In general, most of these deeply ingrained assumptions go back to FileStore days. Note that some of these comments still talk about files ;)

Backing store compression messes with a lot such assumptions but, for "rbd create --thick-provision", I think it is relatively known that it falls over in that case. The new zero block detection is much worse though as it is enabled by default and there is no configuration option to disable it.

The expected new behavior is:
- when writing zeros with hint "PROVISIONING", do not write data, but allocate disk space and mark region as "ZEROS"

Currently there is no PROVISIONING hint. Do you intend to introduce one?

The intended result is:
- stats are properly tracked, so that free space is reduced when thick-provisioned objects are placed

I think that to make this consistent we must take care of:
1) Move of thick-provisioned object to other OSD must preserve thick-provisioned behavior

Wasn't that true prior to https://github.com/ceph/ceph/pull/43337? IIRC it is based on ObjectStore::fiemap which, prior to https://github.com/ceph/ceph/pull/43337, returned a full-sized extent for a bunch of explicitly written zeroes. Or am I misremembering?

2) Compression of objects created by thick-provisioning be disabled
3) Cloning thick-provisioned regions ("ZEROS") should actually re-allocate

IMO cloning in general shouldn't change anything about the cloned region. If it is compressed, it should stay that way. If it isn't compressed, it should say that way. If it is allocated-but-unwritten (what you refer to as "ZEROS" above), it should stay that way. Finally, if it is a bunch of explicitly written zeroes, it should stay that way too.

Please note that when writing to objects BlueStore is not using already allocated space but
always allocates new space and releases previous one. This means that thick-provisioned space will not be used even once.

Yup, this is understood -- and, as long as the previously allocated space is "carried over" (i.e. the same amount of space gets allocated in the new location), is the expected behavior.

#13 Updated by Laura Flores 9 months ago

@Ilya just had a chat with Adam, and we think that the best thing for an immediate fix for this is to implement a global option to toggle the feature off by default. This way, we could keep the feature in the codebase and build on it to meet RBD requirements. Let me know what you think.

#14 Updated by Laura Flores 9 months ago

  • % Done changed from 0 to 90
  • Pull request ID set to 46193

#15 Updated by Laura Flores 9 months ago

  • Status changed from In Progress to Fix Under Review
  • % Done changed from 90 to 100

#16 Updated by Laura Flores 8 months ago

  • Status changed from Fix Under Review to Pending Backport

#17 Updated by Laura Flores 8 months ago

  • Copied to Backport #55811: quincy: bluestore zero block detection makes all-zeroes writes deallocate space added

#18 Updated by Laura Flores 8 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF