Project

General

Profile

Feature #40285

mds: support hierarchical layout transformations on files

Added by Patrick Donnelly 2 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

The main goal of this feature is to support moving whole trees to cheaper storage hardware. This can be done manually by the client by opening a file, setting a file layout to use another lower-cost data pool, and then copying the file data over. This is laborious and presents consistency issues if the file is opened in parallel by another application.

So the idea of this feature is to support an extended attribute like:

    ceph.[dir|file].migrate.<layout>

This would update the inode for the dir/file with the new migration layout along with some epoch (globally incremented). The MDS would the insert this inode into a migration queue similar to the existing purge queue.

The MDS would be responsible for pulling inodes off of the migration queue and updating the actual layout. For directories, this means queueing the directory's dentries (inodes) for migration as well. For files, the MDS needs to obtain all locks on the inode's contents and perform the actual migration. This operation prevents other clients from using the file into the migration completes. Once the migration finishes, the inode's file layout can be updated.

Tricky implementation points:

  • How to respond to migration layout updates on an inode while a migration is in progress? This is part of the reason for the epoch as it allows us to ignore older migration layout updates (in the event of hard linked files in multiple directories). I think one reasonable approach to this is to re-insert the inode into the back of the migration queue continually until the current migration completes.
  • How to improve performance here: we can trivially multithread the migration process but we need to consider the ramifications of the MDS using many more cores.

History

#1 Updated by Patrick Donnelly 2 months ago

  • Description updated (diff)

#2 Updated by Robert LeBlanc about 2 months ago

Patrick Donnelly wrote:

The main goal of this feature is to support moving whole trees to cheaper storage hardware. This can be done manually by the client by opening a file, setting a file layout to use another lower-cost data pool, and then copying the file data over. This is laborious and presents consistency issues if the file is opened in parallel by another application.

So the idea of this feature is to support an extended attribute like:

[...]

This would update the inode for the dir/file with the new migration layout along with some epoch (globally incremented). The MDS would the insert this inode into a migration queue similar to the existing purge queue.

The MDS would be responsible for pulling inodes off of the migration queue and updating the actual layout. For directories, this means queueing the directory's dentries (inodes) for migration as well. For files, the MDS needs to obtain all locks on the inode's contents and perform the actual migration. This operation prevents other clients from using the file into the migration completes. Once the migration finishes, the inode's file layout can be updated.

Tricky implementation points:

  • How to respond to migration layout updates on an inode while a migration is in progress? This is part of the reason for the epoch as it allows us to ignore older migration layout updates (in the event of hard linked files in multiple directories). I think one reasonable approach to this is to re-insert the inode into the back of the migration queue continually until the current migration completes.
  • How to improve performance here: we can trivially multithread the migration process but we need to consider the ramifications of the MDS using many more cores.

How about combine this with tiering so that the data movement is done by the tiering code? It could even be automated.
1. Admin creates destination pool.
2. Executes some command `ceph fs migrate old_pool new_pool`.
3. Ceph sets up the old tier as a cache layer then adds new_pool as a lower layer.
4. Ceph sets up an overlay so the new_pool references the old_pool.
5. Ceph evicts the cache tier.
6. The MDS walks the metadata updating the pool location to the new pool.
7. Once the eviction and metadata has been updated, remove the overlay and the old_pool.

There is probably some more housekeeping to be done, but this seems to reuse things that are already available, and reduces the load on the MDS and may address some of the locking brought up previously.

I don't have time to look in the code at the moment, but if someone can give me some functions as a starting point, that would be helpful for when I do have some time.

Also available in: Atom PDF