Quotas vs subtrees


Generalize and adapt the SnapRealm subtree mechanism into a generic subvolume/subtree concept that is (1) explicitly managed/visible to the admin, (2) used by both snapshots and quotas.


Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Current Status

The snapshots break the namespace into SnapRealms, which are subtree chunks that share the same snapshot context (i.e., have the same set of snapshots applied).
New SnapRealms are created when
  1. a snap is created at a new point in the hierarchy.
  2. a subdir in one snaprealm is renamed into another snaprealm. the subdir becomes the root of a new snaprealm that nested inside the target, which a past_parent pointer to the former.

When the new realm is created it is a 'split' event. This is some expensive and involves a message to the client that enumerates all of the ino's with client caps that need to be moved into the child realm. The client thus has a coherent view of which realm any given inode belongs to at all times.

Detailed Description

There are some challenges with teh snaprealm code, particularly when dealing witht the past_parents relationship. This is mostly caused when opening up an inode in teh cache: we need the past_parents in order to generate a valid SnapContext for the realm, but that past parent might be in some other part of the hierarchy and take time to resolve. Until we have it, we cannot issue caps to clients, and we currently aren't smart enough to avoid doing so. There is also some very complex code that manages propagation of rstat values to past parents after a snapshot has been taken.
The whole situation would be simplified if we did not allow renaming directories between subvolumes/snaprealms.
If we did that, then there would be no past_parents. the snap issues get much simpler.
We could also make the subvolume management explicit. e.g.,
attr -s mydir ceph.subvolume
or whatever, so that the admin decides where teh subvolume boundaries, and thus when -EXDEV will happen on rename.
If there were a subvol concept, then quotas would map onto that naturally.
What that buys us:
  1. clients know what root (inode) every open file belongs to, and thus what rstat value to pay attention to for quota
  2. same mds/client messages can manage the subvol <-> inode relationship
  3. when split is implemented in the future, we cna piggyback on the split messages. on the other hand,
    1. snaprealms are implicitly created when you rename c from realm a to realm b. for quotas, we only care whether we are beneath b.. not that we are inside a c nested inside a and b.
    2. so maybe we need to distinguish between snaprealm-things that are subvol roots and those that are not
Option 1
  1. rename SnapRealm to SubvolRealm
  2. rename MClientSnap message to MClientSubvol or similar
  3. separate new realm creation into an explicit subvol creation op, triggered by a vxattr or new mds op
  4. only allow quotas to be set on subvol roots
  5. use existing snapbl (renamed subvolbl) to associate all inodes with the subvol root
  6. [maybe] allow rename between subvols with no snaps
    1. add a new MOVE op, distinct but similar to split, that simply moves inodes to a different realm. this will be used when you rename a dir between subvols.
  7. [someday] enable rename between subvols with snaps
    1. add a SubvolRealm property that indicates whether it si a subvol root or not
    2. make split work to enable snaps vs renames.
    3. mds: fix things with opening past_parents
Option 2
  1. add a new qtree (or subvol) construct
  2. instantiate in client cache and mds cache
  3. chain all inodes to the subvol they belong to
  4. mark subvol in any inodestat reply to client
  5. add a new MOVE message used on rename