Mds - reduce memory consumption


The MDS internal cache structs are very large, reducing the amount of metadata that ceph-mds can cache at a time. Most of the fields are only used when metadata is dirty.


  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Interested Parties

  • Sage Weil (Inktank)
  • Danny Al-Gaaf

Current Status

The CInode struct is > 1KB, and CDir and CDentry are also quite large. Most fields are only used for dirty metadata. On startup, ceph-mds dumps the struct sizes to its log.
The cache size is currently controlled via a simple count on the number of inodes (mds cache size).

Detailed Description

Since most of the fields are only used when metadata is dirtied, they can be moved into an auxiliary structure that is allocated on the heap when necessary. For example, CInode could have a member dirty_state_t *ds; that is allocated when it is dirtied and freed when the changes fully commit and flush.
There are two phases that dirty/modified metadata goes through. One is the "pre-dirty", "projected" changes that exist only in memory that track state while we are waiting for the modification to reach the journal. The second phase is the (much longer) period where the metadata is durable and committed but still pinned in memory because the change hasn't been written to the per-directory metadata object.

Work items

Coding tasks

  1. CInode: classify which fields are necessary for projected changes and which are needed for dirty (journaled) metadata. Decide whether we want two auxiliary structures for each phase or just only one for projected changes.
  2. CInode: create CInode substructure(s) and any helpers related to access or allocation/deallocation
  3. CInode: wire allocation/deallocation into projected/predirty lifecycle (allocation into project_...(), deallocation in pop_dirty_projected()
  4. CInode: wire allocation/deallocation into dirty/journaled lifecycle (allocation in the predirty or dirty methods, deallocation when metadata is finally written to the directory fragment object)
  5. CInode: move fields into substructure. this can be iterative, probably one patch for each field or related group of fields.
  6. CDentry: repeat
  7. CDir: repeat
  8. create boost memory pool for substructures for better allocator efficiency
  9. consider whether any inode_t or dirfrag_t fields should be dynamically allocated