https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2018-10-29T11:11:42ZCeph RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1236832018-10-29T11:11:42ZVitaliy Filippovvitalif@yourcmc.ru
<ul><li><strong>File</strong> <a href="/attachments/download/3790/photo_2018-10-29_14-10-38.jpg">photo_2018-10-29_14-10-38.jpg</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/3790/photo_2018-10-29_14-10-38.jpg">View</a> added</li><li><strong>File</strong> <a href="/attachments/download/3791/photo_2018-10-29_14-10-43.jpg">photo_2018-10-29_14-10-43.jpg</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/3791/photo_2018-10-29_14-10-43.jpg">View</a> added</li></ul><p>Proofs from our prometheus monitoring. Two graphs from yesterday: one with number of objects in cluster and other with used capacity.</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1236892018-10-29T12:12:24ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>ceph df output:</p>
<pre>
[root@sill-01 ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
38 TiB 6.9 TiB 32 TiB 82.03
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
ecpool_hdd 13 16 TiB 93.94 1.0 TiB 7611672
rpool_hdd 15 9.2 MiB 0 515 GiB 92
fs_meta 44 20 KiB 0 515 GiB 23
fs_data 45 0 B 0 1.0 TiB 0
</pre> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1236902018-10-29T12:13:57ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>How to heal it? If I don't heal it I'll need to purge the whole cluster? O_o...</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1237312018-10-29T20:37:21ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Tracker</strong> changed from <i>Bug</i> to <i>Support</i></li><li><strong>Status</strong> changed from <i>New</i> to <i>Closed</i></li></ul><p>The mailing list is a better place to resolve this. My guess is data hasn't been cleaned up from its old locations yet, but <strong>shrug</strong>.</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1237702018-10-29T22:33:09ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>Thanks for the response, I wrote to the mailing list ceph-users (is it the correct place?) :)</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1237822018-10-30T10:31:56ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>In fact it doesn't seem that it will self-heal, and nobody seems to care about it in the mailing list by now...)</p>
<p>Currently I have NO PGs that are in the process of moving. All are active+clean. So they shouldn't use extra space. But it seems they do...</p>
<p>As I understand if I remove all pg-upmaps and let backfill finish the cluster will just eat all the space and stop. If that's not a bug... I think I misunderstand the concept of a bug :)</p>
<p>OK if some kind of "garbage collection" will come in action at some point, but it doesn't seem so. I tried to search for any mentions of something similar in the documentation and code and I only found bluestore_gc_* which seems to be only relevant for compressed pools.</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1237992018-10-30T13:05:01ZBen Englandbengland@redhat.com
<ul></ul><p>How are you writing these objects? Most sites that used EC were using RGW, but I don't see all the pools that go with RGW in ceph df. So are you using EC with RBD? That's what I'm guessing. Cephfs pools look empty. So how big are the total RBD volume sizes that you created and have you overcommitted RBD volumes? For example, RBD volumes are thinly provisioned (space is not allocated when you create the volume, only when you write to it), so is it possible that existing RBD volumes are just getting written to and this is using more space? It appears that the RADOS objects are approx 2 MB in size on average.</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1238142018-10-30T15:33:17ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>Yes, I'm using EC with RBD and partial overwrites enabled. CephFS pools are only created recently for tests and do not hold any data.</p>
<p>RBDs are thin provisioned, biggest ones are one ~14TB base image and one ~2TB base image with several clones and snapshots. The supposed usage is to put a big DB in Ceph, create clones, run tests on the clones, then either discard failed clones or merge good clones back into the base image. I patched rbd export-diff to allow to export clone diffs without parent data for that, it works, however it's not yet integrated into our CI scripts.</p>
<p>RADOS objects are 4 MB (we didn't change RBD default), but probably split in 2 parts with EC...</p>
<p>As I understand, new RADOS objects are created when an RBD image (or even clone) is written to. So clients writing +1.3 TB should be noticeable on the graph (see attached picture). But according to the graph, clients have only written +200 objects in that time...</p>
<p>Even if we suppose that Bluestore is smart enough to share some data between objects of parent and child RBD images at OSD level (although it looks very nontrivial for me) and this connection breaks during rebalance and "COW" clones become not really "COW"... I think even this case shouldn't lead to +1.3TB storage increase, because I've recently been re-importing them into RBD using my patched rbd export-diff/import-diff and they occupied only ~100GB in total.</p>
<p>By the way, 1.3TB is Sunday's number, Monday's is +500GB more. I.e. now roughly the same amount of data as it was on Sunday morning uses 1.8TB more raw storage...</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1240102018-11-02T14:36:33ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>OK, I looked into OSD datastore using ceph-objectstore-tool and I see that for almost every object there are two copies, like:</p>
<pre>
2#13:080008d8:::rbd_data.15.3d3e1d6b8b4567.0000000000361a96:28#
2#13:080008d8:::rbd_data.15.3d3e1d6b8b4567.0000000000361a96:head#
</pre>
<p>And more interesting is the fact that these two copies don't differ (!).</p>
<p>So the space is taken up by unneeded snapshot copies.</p>
<p>rbd_data.15.3d3e1d6b8b4567 is the prefix of the biggest (14 TB) base image we have. This image has 1 snapshot:</p>
<pre>
[root@sill-01 ~]# rbd info rpool_hdd/rms-201807-golden
rbd image 'rms-201807-golden':
size 14 TiB in 3670016 objects
order 22 (4 MiB objects)
id: 3d3e1d6b8b4567
data_pool: ecpool_hdd
block_name_prefix: rbd_data.15.3d3e1d6b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool
op_features:
flags:
create_timestamp: Tue Aug 7 13:00:10 2018
[root@sill-01 ~]# rbd snap ls rpool_hdd/rms-201807-golden
SNAPID NAME SIZE TIMESTAMP
37 initial 14 TiB Tue Aug 14 12:42:48 2018
</pre>
<p>The problem is this image has NEVER been written to after importing it to Ceph with RBD. All writes go only to its clones.</p>
<p>So I have 2 questions:</p>
<p>1) Why base image snapshot is "provisioned" while the image isn't written to? May it be related to `rbd snap revert`? (i.e. does rbd snap revert just copy all snapshot data into the image itself?)</p>
<p>2) If all parent snapshots seem to be forcefully provisioned on write: Is there a way to disable this behaviour? Maybe if I make the base image readonly its snapshots will stop to be "provisioned"?</p>
<p>3) Even if there is no way to disable it: why does Ceph create extra copy of equal snapshot data during rebalance?</p>
<p>4) What's ":28" in rados objects? Snapshot id is 37. Even in hex 0x28 = 40, not 37. Or does RADOS snapshot id not need to be equal to RBD snapshot ID?</p>
<p>5) Am I safe to "unprovision" the snapshot? (for example, by doing `rbd snap revert`?)</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1240112018-11-02T14:37:25ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>Oops. That's more than 2 questions. But anyway :)</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1240182018-11-02T16:57:37ZBen Englandbengland@redhat.com
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>rbd</i></li></ul><p>since you've identified that this is an RBD workload, assigning it to that project so that RBD team notices it. HTH.</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1240202018-11-02T17:14:07ZJason Dillamandillaman@redhat.com
<ul><li><strong>Project</strong> changed from <i>rbd</i> to <i>RADOS</i></li></ul><p>Back-and-forth question answering like this is probably better for the mailing list (the ticket is currently closed FYI).</p>
<p>Also moving this back to RADOS since it's not really related to RBD.</p>
<p>However, ...</p>
<p>Vitaliy Filippov wrote:</p>
<blockquote>
<p>OK, I looked into OSD datastore using ceph-objectstore-tool and I see that for almost every object there are two copies, like:</p>
<p>[...]</p>
<p>And more interesting is the fact that these two copies don't differ (!).</p>
<p>So the space is taken up by unneeded snapshot copies.</p>
<p>rbd_data.15.3d3e1d6b8b4567 is the prefix of the biggest (14 TB) base image we have. This image has 1 snapshot:</p>
<p>[...]</p>
<p>The problem is this image has NEVER been written to after importing it to Ceph with RBD. All writes go only to its clones.</p>
<p>So I have 2 questions:</p>
<p>1) Why base image snapshot is "provisioned" while the image isn't written to? May it be related to `rbd snap revert`? (i.e. does rbd snap revert just copy all snapshot data into the image itself?)</p>
</blockquote>
<p>If you run "rbd snap revert", you will copy all the data from the snapshot to the HEAD version.</p>
<blockquote>
<p>2) If all parent snapshots seem to be forcefully provisioned on write: Is there a way to disable this behaviour? Maybe if I make the base image readonly its snapshots will stop to be "provisioned"?</p>
</blockquote>
<p>Not sure what what you are referring to here.</p>
<blockquote>
<p>3) Even if there is no way to disable it: why does Ceph create extra copy of equal snapshot data during rebalance?</p>
</blockquote>
<p>I suspect it shouldn't.</p>
<blockquote>
<p>4) What's ":28" in rados objects? Snapshot id is 37. Even in hex 0x28 = 40, not 37. Or does RADOS snapshot id not need to be equal to RBD snapshot ID?</p>
</blockquote>
<p>Did you delete a snapshot on that image before w/ the snapshot id of 40?</p>
<blockquote>
<p>5) Am I safe to "unprovision" the snapshot? (for example, by doing `rbd snap revert`?)</p>
</blockquote>
<p>That's will only re-copy the data to the HEAD revision.</p> RADOS - Support #36614: Cluster uses substantially more space after rebalance (erasure codes)https://tracker.ceph.com/issues/36614?journal_id=1241622018-11-05T13:30:48ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><blockquote>
<p>I suspect it shouldn't.</p>
</blockquote>
<p>But it does exactly that.</p>
<blockquote>
<p>That's will only re-copy the data to the HEAD revision.</p>
</blockquote>
<p>And it seems it provisions the whole image, even non-written-to regions.</p>
<p>I'll try to reproduce both and file a more concrete bug.</p>