osd: recovery does not preserve copy-on-write allocations between object clones after 'rbd revert'
Hi. I've already reported it in issue 36614, but here is a more concrete case.
- Start with a bluestore Ceph cluster
- Create an RBD image
- Fill it with data
- Remember disk space used by the image as X
- Create a snapshot of it
- Immediately revert to it (rbd snap revert)
- After revert finishes you'll see that there was still X space used, but object count in the cluster is doubled
- Trigger a massive rebalance in the cluster
- After rebalance finishes you'll see that the image's objects residing in moved PGs now use 2*X disk space. This is because virtual clones stop being virtual after their data is moved
- Now run rbd snap revert again
- You'll see the space usage drop. This is because "virtual clones" become "virtual" again.
I think it's a bug and should be fixed. It had led to a bad situation in our cluster once, described in issue 36614.
#2 Updated by Sage Weil over 2 years ago
- Project changed from bluestore to RADOS
- Subject changed from Virtual clones break and begin to eat space after rebalancing to osd: recovery does not preserve copy-on-write allocations between object clones after 'rbd revert'
- Status changed from New to 12
This is indeed the current behavior. The OSD isn't clever enough to preserve the shared allocations across recovery. It is a large effort to change this.