Project

General

Profile

Feature #190

krbd: DISCARD support

Added by Greg Farnum almost 14 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:

Description

TRIM does exist, somewhere in Linux. RBD should support it so if the client system is using a supporting filesystem, they won't use up more disk space than they need.

History

#1 Updated by Yehuda Sadeh almost 14 years ago

Yeah, it's called 'discard'. In order to get a block device to support it we need to do something like:

queue_flag_set_unlocked(QUEUE_FLAG_DISCARD,
tr->blkcore_priv->rq);

and the block device queue should check whether this is actually a discard request using:

if (blk_discard_rq(req))
return tr->discard(dev, block, nsect);

#2 Updated by Sage Weil over 13 years ago

  • Tracker changed from Bug to Feature

#3 Updated by Sage Weil over 13 years ago

  • Category set to 7

#4 Updated by Sage Weil over 13 years ago

  • Estimated time set to 5.00 h
  • Source set to 3

#5 Updated by Sage Weil over 13 years ago

  • Project changed from 3 to Linux kernel client
  • Category deleted (7)

#6 Updated by Sage Weil about 13 years ago

  • translation missing: en.field_position deleted (504)
  • translation missing: en.field_position set to 394

#7 Updated by Sage Weil almost 13 years ago

  • Subject changed from Support TRIM to rbd: DISCARD support
  • Category set to rbd
  • Target version set to v3.0

#8 Updated by Sage Weil almost 13 years ago

  • translation missing: en.field_position deleted (400)
  • translation missing: en.field_position set to 1
  • translation missing: en.field_position changed from 1 to 564

#9 Updated by Sage Weil almost 13 years ago

  • Target version changed from v3.0 to v3.1
  • translation missing: en.field_position deleted (567)
  • translation missing: en.field_position set to 2

#10 Updated by Sage Weil almost 13 years ago

  • Target version changed from v3.1 to v3.0
  • translation missing: en.field_position deleted (2)
  • translation missing: en.field_position set to 565

#11 Updated by Sage Weil almost 13 years ago

  • translation missing: en.field_position deleted (565)
  • translation missing: en.field_position set to 9

#12 Updated by Sage Weil almost 13 years ago

  • Target version changed from v3.0 to v3.1

#13 Updated by Sage Weil over 12 years ago

  • Target version changed from v3.1 to v3.2

#14 Updated by Sage Weil about 12 years ago

  • Target version deleted (v3.2)

#15 Updated by Sage Weil about 12 years ago

  • translation missing: en.field_position deleted (29)
  • translation missing: en.field_position set to 28

#16 Updated by Sage Weil over 11 years ago

  • Target version set to v3.6
  • translation missing: en.field_position deleted (22)
  • translation missing: en.field_position set to 1

#17 Updated by Sage Weil over 11 years ago

  • Project changed from Linux kernel client to rbd
  • Category deleted (rbd)
  • Target version deleted (v3.6)

#18 Updated by Sage Weil over 11 years ago

  • Subject changed from rbd: DISCARD support to krbd: DISCARD support
  • translation missing: en.field_position deleted (8)
  • translation missing: en.field_position set to 8

#19 Updated by Josh Durgin over 11 years ago

  • translation missing: en.field_position deleted (34)
  • translation missing: en.field_position set to 35

#20 Updated by Josh Durgin over 11 years ago

  • translation missing: en.field_position deleted (35)
  • translation missing: en.field_position set to 1339

#21 Updated by Sage Weil over 11 years ago

  • translation missing: en.field_position deleted (1347)
  • translation missing: en.field_position set to 37

#22 Updated by Kyle Tarplee over 10 years ago

These seems like a pretty important feature to me. Without this feature the storage necessary for a highly active (but nearly empty) filesystem will overtime become thickly provisioned and waste lots of ceph storage space. It seems this feature is necessary to make the krbd module a viable option.

#23 Updated by Sage Weil almost 10 years ago

  • Target version set to v0.80

#24 Updated by Neil Levine almost 10 years ago

  • Status changed from New to 12

#25 Updated by Josh Durgin almost 10 years ago

  • Status changed from 12 to Fix Under Review

#26 Updated by Ian Colle almost 10 years ago

  • Assignee set to Josh Durgin

#27 Updated by Ian Colle almost 10 years ago

  • Target version changed from v0.80 to sprint

#28 Updated by Ian Colle almost 10 years ago

  • Target version changed from sprint to v0.81

#29 Updated by Brian Cline almost 10 years ago

Definitely agree with Kyle. Due to this, and after finding out that an XFS fstrim within QEMU reports success but doesn't actually recover the free space from Ceph (as became evident with a test script at https://gist.github.com/briancline/9242487), I'm having to either queue up all the large items that need deleting, spin up a QEMU instance once I have a significant enough amount of data that needs deleting, run a similar script as above, except one that performs all the large file deletes, shuts down the VM once it's done and unmounted the RBD volume cleanly, then remounts in the proper hardware host.

Of course this is a hilariously terrible endeavor, so I attempted another [more] terrible alternative: periodically create a new RBD, copy data from a prior one into the new one, and delete the old RBD. Unfortunately, for the two main RBDs I need to recover space for, my cluster doesn't have the adequate space to cover this [within each failure domain with 2 replicas].

So, I did a test run with 1 replica spread across two failure domains I invented in the CRUSH map that simply span 5 properly-weighted OSDs each, to even out the hit on free space a bit. But then one OSD got stuck after filling itself all the way up; a PG housed on that OSD then continually got stuck each time it attempted a scrub, which evolved into me deciding to scrap the idea, remove the new RBD test, and when that got stuck and I stopped it and attempted to restart its OSD, it refusing to start when it reads the log for that PG (don't worry, I already added a comment on a more proper bug report for that issue -- an assertion failure, basically).

My only other alternative is to map and mount the RBDs and operate all the necessary ongoing SMB sharing under QEMU entirely, where I won't be able to achieve the same performance characteristics that caused me to use hardware to begin with.

I know this is getting a bit long, but, particularly with as silly and terrible as these options are, I do want to share with you guys the sort of war stories that originate from lack of TRIM/discard in krbd.

If there's any way to bump up priority on this feature, that'd be really, really awesome -- I'd normally be more than willing to contribute the code to get the party started on a feature I feel so strongly about, especially having very good C chops, but unfortunately I don't have the kernel chops nor FS driver chops to tackle this.

Please let me know if I can be of some help.

#30 Updated by Ian Colle almost 10 years ago

  • Status changed from Fix Under Review to 7

#31 Updated by Ian Colle almost 10 years ago

  • Target version changed from v0.81 to v0.82

#32 Updated by Ian Colle almost 10 years ago

  • Target version changed from v0.82 to v0.85

#33 Updated by Ian Colle over 9 years ago

  • Assignee changed from Josh Durgin to Ilya Dryomov

#34 Updated by Ian Colle over 9 years ago

  • Target version changed from v0.85 to v0.86

#35 Updated by Ian Colle over 9 years ago

  • Project changed from rbd to Linux kernel client
  • Target version deleted (v0.86)

#36 Updated by Ian Colle over 9 years ago

  • Target version set to sprint

#37 Updated by Alphe Salas over 9 years ago

I agree with Kyle and Brian. This feature is necessary. I would like to have more information about the status of this feature more than a recurent "it will come into next ceph version".

this feature is necessary to have a better control on the data and replica use on a some tens of terabyte fixed size envirronement.

The turn around is to split the single big RBD image into small rbd images if before hand you know that the data a meant to be delete in a short period. So when deleting those smaller RBD images the data and replicas are really available for further use.

This means a bigger involvement in management then creating a big RDB image.

Ideally if I want to make things smooth I create a ceph cluster then create a RDB image of a bit less than half of that image. And I expect that DATA + REPLICAS will always remains at most in the 80% of global use I set (that is if the osds equilibrate properly their data use which is a problem too since I already notice that when some osd where stuck in near_full state some other where in low use normally trimming/dsicard should correct that behavior too).

#38 Updated by Alphe Salas over 9 years ago

Alphe Salas wrote:

I agree with Kyle and Brian. This feature is necessary. I would like to have more information about the status of this feature more than a recurent "it will come into next ceph version".

this feature is necessary to have a better control on the data and replica use on a some tens of terabyte fixed size envirronement.

The turn around is to split the single big RBD image into small rbd images if before hand you know that the data a meant to be delete in a short period. So when deleting those smaller RBD images the data and replicas are really available for further use.

This means a bigger involvement in management than creating a single big RDB image.

Ideally if I want to make things smooth I create a ceph cluster then create a RDB image of a bit less than half of that image. And I expect that DATA + REPLICAS will always remains at most in the 80% of global use I set (that is if the osds equilibrate properly their data use which is a problem too since I already notice that when some osd where stuck in near_full state some other where in low use normally trimming/dsicard should correct that behavior too).

#39 Updated by Sage Weil over 9 years ago

This should go upstream to Linus in the next day or two (for 3.18-rc1).

#40 Updated by Ian Colle over 9 years ago

  • Status changed from 7 to Resolved

#41 Updated by Josh Durgin almost 9 years ago

  • Target version deleted (sprint)

Also available in: Atom PDF