Project

General

Profile

Actions

Feature #45729

closed

pybind/mgr/volumes: Add the ability to keep snapshots of subvolumes independent of the source subvolume

Added by Shyamsundar Ranganathan almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
octopus,nautilus
Reviewed:
Affected Versions:
Component(FS):
mgr/volumes
Labels (FS):
Pull request ID:

Description

From the perspective of CSI and its volume life cycle management, a snapshot of a volume is expected to survive beyond the volume itself. IOW, the volume maybe deleted and later recreated from one of its prior snapshots.

Although, the CSI protocol has changed over time to allow snapshots to depend on their sources, and disallowing source volume deletion if snapshots exists, it is not a natural flow of events and life cycle management operations.

It is hence desired that snapshots remain independent from the source subvolume, to aid such life cycle operations as detailed above.

With CephFS subvolume snapshots are taken at the directory level of the subvolume, and hence are dependent on the subvolume. To delete the subvolume it is required that all snapshots within the subvolume are deleted first. This breaks the above desired state.

Backport note: As this pertains to CSI, the usual request is to see how best this can be back ported till Nautilus for supporting existing installations.

Solution thoughts:
On discussion with Ramana, it was thought that we could take a snapshot at a higher level than the subvolume, as we now have a subvolume path with a UUID in it for cloning reasons. Thus, a delete of a subvolume is independent of the snapshots, as these are outside the leaf subvolume directory.

For example, the current subvolume directory structure is /volumes/<group-name>/<user-provided-subvol-name>/<cephfs-gen-uuid>/ where the subvolume mount path is the entire directory path, and hence all user data exists within the <cephfs-gen-uuid>. Currently snapshots are taken of this directory, and hence as long as there are snapshots of the same, it cannot be deleted.

If the snapshots were instead of <user-provided-subvol-name>, we can delete the <cephfs-gen-uuid> directory when the volume needs to be deleted, but retain the snapshots independent of the same, just above it.


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #46820: octopus: pybind/mgr/volumes: Add the ability to keep snapshots of subvolumes independent of the source subvolumeResolvedShyamsundar RanganathanActions
Copied to CephFS - Backport #46821: nautilus: pybind/mgr/volumes: Add the ability to keep snapshots of subvolumes independent of the source subvolumeResolvedShyamsundar RanganathanActions
Actions #1

Updated by Patrick Donnelly almost 4 years ago

  • Status changed from New to Need More Info
  • Assignee set to Shyamsundar Ranganathan
  • Target version set to v16.0.0
  • Backport changed from octopus, nautilus to octopus,nautilus
Actions #2

Updated by Venky Shankar almost 4 years ago

  • Status changed from Need More Info to New

After some discussion and agreeing on the approach, below is the proposed design:

Direct Addressing Scheme for Snapshots

Snapshots are tied to subvolumes. This enforces a scheme that requires subvolume access for addressing snapshots. Such a scheme requires searching all subvolume incarnations to lookup a snapshot. To mitigate the cost involved in searching, snapshots are taken at the subvolume directory level (rather than just user data) and include active subvolume metadata. This decouples snapshots from subvolumes. Algorithmically, the cost of searching a snapshot is still O(N), however, most implementations will have the search optimized by performing doing a single filesystem call rather than performing multiple calls one for each subvolume incarnation. Metadata in a snapshot plays a critical role which is detailed in subsequent sections.

Snapshots listing does not involve aggregating snapshots from all incarnations. Again, algorithmically, cost is still O(N), however, most implementations are optimized by performing a single filesystem call. However, subvolume listing still needs to exclude inactive subvolumes.

Maintaining Subvolume Incarnations

UUID directory for the subvolume will be pruned but keeping its metadata intact for future referencing (for snapshots). Implementation wise, metadata file would be prefixed with an incarnation-id. Active subvolumes metadata will have a "pointer" (in the form of a symbolic link) to its metadata file.

/volumes/<group>/sub0/.meta.0
/volumes/<group>/sub0/.meta.1
/volumes/<group>/sub0/<uuid3>
/volumes/<group>/sub0/.meta.2
/volumes/<group>/sub0/.meta  -> ./.meta.2

Metadata Updates

Subvolume snapshot operations need to update subvolume metadata. A subvolume could have been created at an earlier incarnation that current. So, how does a snapshot operations map a snapshot to its subvolume incarnation? Remember, that snapshots include active subvolume metadata -- subvolume metadata in a snapshot points to the incarnation during which the snapshot was created, hence, the incarnation of the subvolume itself. I.e., Subvolume metadata in a snapshot is a wormhole to its subvolume incarnation. This simplifies snapshot operations, especially cloning which had to store the incarnation-id in clone subvolume metadata (C.f., Clone Metadata).

Metadata Garbage Collection

The issue remains of garbage collecting subvolume metadata. Since snapshots are directly addressable, there is no way of knowing when all snapshots for a subvolume incarnation have been deleted. To mitigate this, the subvolume metadata needs to keep a track of active snapshot count. A snapshot delete operation adjusts the active snapshot count for its respective subvolume and prunes metadata when no snapshots exist. Note that metadata purge should only be performed when the snapshot to be deleted not a part of currently active subvolume.

Incarnation-Id Choices

Incarnation-id assigned to subvolume should be distinct each time a subvolume is created. Choices here range from distinct strings, integer values, UUIDs. Implementations can choose whatever fits well to the system. UUIDs are perhaps the best choice for almost all implementations (plus an added benefit of reusing uuid component from subvolume path).

Actions #3

Updated by Shyamsundar Ranganathan almost 4 years ago

  • Status changed from New to In Progress
  • Pull request ID set to 35647
Actions #4

Updated by Patrick Donnelly over 3 years ago

  • Status changed from In Progress to Pending Backport
Actions #5

Updated by Patrick Donnelly over 3 years ago

  • Copied to Backport #46820: octopus: pybind/mgr/volumes: Add the ability to keep snapshots of subvolumes independent of the source subvolume added
Actions #6

Updated by Patrick Donnelly over 3 years ago

  • Copied to Backport #46821: nautilus: pybind/mgr/volumes: Add the ability to keep snapshots of subvolumes independent of the source subvolume added
Actions #7

Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF