Cephfs - separate purge queue from MDCache

Recently, throttling was added to the process by which the MDS purges deleted files. The motivation was to prevent the MDS from aggressively issuing a huge number of operations in parallel to the RADOS cluster.

That worked, but it has created a new problem. Now, when there are many stray inodes waiting in the queue to be purged, they occupy space in the MDCache, and cause the MDS to stop properly respecting the cache size limit. Because that cache size limit is typically set to reflect the physical limitations of the MDS host, that is a problem.

We either need to throttle incoming unlink operations (i.e. apply back-pressure to the client to reflect the limited rate at which we can purge), or modify the MDS's stray handling to remove the need to keep all the purgeable-strays in cache until they're purged. This blueprint describes the latter approach.

Related issue:

John Spray (Red Hat)

Interested Parties
Name (Affiliation)

Current Status
While implementing throttling, stray handling was refactored into StrayManager and tests added. This will build on that.

Detailed Description

The purge queue becomes a persistent work queue. In circumstances where we currently put a CDentry on the in-memory queue, we would instead write it to the work queue, and when that write was persistent we would follow the current path for when a purge is complete. To avoid issuing a write for every file being purged, we probably need to buffer these up. Maybe re-use Journaler.

The consumed work items are not going to complete in order (files take radically different lengths of time to purge). That could leave the a region of the queue with mostly already-purged items that we haven't trimmed yet because of one awkward enormous file that's still purging.

When the consumer is local to the producer, and throttle slots are available (i.e. nothing waiting in queue), avoid doing spurious read of the just-written work item by starting it immediately and advancing the queue read position past it.

On MDS rank shutdown, we could: * add a message for MDSs to hand off their queue to another MDS * or we could have the dying MDS wait until it has cleared its queu * or have the dying MDS leave its queue on disk and have another MDS pick it up, either reading from an in-rados table recording what queues exist, or by adding it to the MDSmap and having MDSMonitor explictly assign an old queue to an MDS.

Recovery and repair: this would be an addition to the stuff that cephfs-journal-tool/cephfs-table-tool does, to reset it in case of damage.