discarding / unmapping / trimming blocks on an image that has snapshots INCREASES disk usage instead of reducing.
How to reproduce:
1. Create image, connect to VM and install any OS. Ensure that KVM uses discard feature and virtio-scsi driver.
2. Run `fstrim --all` in guest OS and ensure that disk usage in reduced in Ceph (i.e. through `rbd du` command)
3. Create a snapshot of RBD image.
4. Remove some files inside guest image
5. Run `fstrim --all` again.
6. See that amount of used space is actually increases for that image!
I suggest, that this happen due to logic that writes (allocates) zeroes instead of unmapping regions, while discarding regions on images, that have snapshot.
There are three solutions for that (we should choose one of them):
1. Just decrement reference count on that region, unmap this region form RBD image. Exactly as it happens for image without snapshots. I don't understand why this was not already done. (Preferred)
2. Just remove any records about that region, so it will refer to original region of base image (if it was allocated before snapshotting). In that case, reading from discarded region will return some old data. It is allowed for SSD for example. (Very simple, but not friendly for some usage, see comments below)
3. Introduce `whiteout` flag in RBD metadata for case when region [that was allocated in base image] is discarded.
AFAIK, there is flag for SCSI device specifying if discarded regions will return zeroes on reading.
This happens on Kraken (server) and jewel on client.
#2 Updated by Марк Коренберг almost 6 years ago
Found also that: https://www.spinics.net/lists/ceph-devel/msg30903.html
#4 Updated by Jason Dillaman almost 6 years ago
- Status changed from New to Need More Info
@Марк: can you set "rbd skip partial discard = true" in your hypervisor host's ceph.conf, configure QEMU's discard granularity to the backing RBD image's object size via "discard_granularity=XYZ" , and retest? Your suggestions don't really map to how Ceph/RBD are actually architected, but if you discard a full backing object (defaults to 4MB), zeroes won't be written and the backing object will be deleted / unreferenced at the HEAD revision of the object.