pybind/mgr/volumes: Add the ability to keep snapshots of subvolumes independent of the source subvolume
From the perspective of CSI and its volume life cycle management, a snapshot of a volume is expected to survive beyond the volume itself. IOW, the volume maybe deleted and later recreated from one of its prior snapshots.
Although, the CSI protocol has changed over time to allow snapshots to depend on their sources, and disallowing source volume deletion if snapshots exists, it is not a natural flow of events and life cycle management operations.
It is hence desired that snapshots remain independent from the source subvolume, to aid such life cycle operations as detailed above.
With CephFS subvolume snapshots are taken at the directory level of the subvolume, and hence are dependent on the subvolume. To delete the subvolume it is required that all snapshots within the subvolume are deleted first. This breaks the above desired state.
Backport note: As this pertains to CSI, the usual request is to see how best this can be back ported till Nautilus for supporting existing installations.
On discussion with Ramana, it was thought that we could take a snapshot at a higher level than the subvolume, as we now have a subvolume path with a UUID in it for cloning reasons. Thus, a delete of a subvolume is independent of the snapshots, as these are outside the leaf subvolume directory.
For example, the current subvolume directory structure is /volumes/<group-name>/<user-provided-subvol-name>/<cephfs-gen-uuid>/ where the subvolume mount path is the entire directory path, and hence all user data exists within the <cephfs-gen-uuid>. Currently snapshots are taken of this directory, and hence as long as there are snapshots of the same, it cannot be deleted.
If the snapshots were instead of <user-provided-subvol-name>, we can delete the <cephfs-gen-uuid> directory when the volume needs to be deleted, but retain the snapshots independent of the same, just above it.
#2 Updated by Venky Shankar 8 months ago
- Status changed from Need More Info to New
After some discussion and agreeing on the approach, below is the proposed design:
Direct Addressing Scheme for Snapshots
Snapshots are tied to subvolumes. This enforces a scheme that requires subvolume access for addressing snapshots. Such a scheme requires searching all subvolume incarnations to lookup a snapshot. To mitigate the cost involved in searching, snapshots are taken at the subvolume directory level (rather than just user data) and include active subvolume metadata. This decouples snapshots from subvolumes. Algorithmically, the cost of searching a snapshot is still O(N), however, most implementations will have the search optimized by performing doing a single filesystem call rather than performing multiple calls one for each subvolume incarnation. Metadata in a snapshot plays a critical role which is detailed in subsequent sections.
Snapshots listing does not involve aggregating snapshots from all incarnations. Again, algorithmically, cost is still O(N), however, most implementations are optimized by performing a single filesystem call. However, subvolume listing still needs to exclude inactive subvolumes.
Maintaining Subvolume Incarnations
UUID directory for the subvolume will be pruned but keeping its metadata intact for future referencing (for snapshots). Implementation wise, metadata file would be prefixed with an incarnation-id. Active subvolumes metadata will have a "pointer" (in the form of a symbolic link) to its metadata file.
/volumes/<group>/sub0/.meta.0 /volumes/<group>/sub0/.meta.1 /volumes/<group>/sub0/<uuid3> /volumes/<group>/sub0/.meta.2 /volumes/<group>/sub0/.meta -> ./.meta.2
Subvolume snapshot operations need to update subvolume metadata. A subvolume could have been created at an earlier incarnation that current. So, how does a snapshot operations map a snapshot to its subvolume incarnation? Remember, that snapshots include active subvolume metadata -- subvolume metadata in a snapshot points to the incarnation during which the snapshot was created, hence, the incarnation of the subvolume itself. I.e., Subvolume metadata in a snapshot is a wormhole to its subvolume incarnation. This simplifies snapshot operations, especially cloning which had to store the incarnation-id in clone subvolume metadata (C.f., Clone Metadata).
Metadata Garbage Collection
The issue remains of garbage collecting subvolume metadata. Since snapshots are directly addressable, there is no way of knowing when all snapshots for a subvolume incarnation have been deleted. To mitigate this, the subvolume metadata needs to keep a track of active snapshot count. A snapshot delete operation adjusts the active snapshot count for its respective subvolume and prunes metadata when no snapshots exist. Note that metadata purge should only be performed when the snapshot to be deleted not a part of currently active subvolume.
Incarnation-id assigned to subvolume should be distinct each time a subvolume is created. Choices here range from distinct strings, integer values, UUIDs. Implementations can choose whatever fits well to the system. UUIDs are perhaps the best choice for almost all implementations (plus an added benefit of reusing uuid component from subvolume path).