Project

General

Profile

Feature #24263

client/mds: create a merkle tree of objects to allow efficient generation of differences between files

Added by Patrick Donnelly 7 months ago. Updated 6 months ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
Start date:
05/23/2018
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Client, Common/Protocol, MDS, kceph
Labels (FS):
task(hard), task(intern)
Pull request ID:

Description

Idea is that the collection of objects representing a file would be arranged as a merkle tree. Any write to an object in the file, as part of the same transaction, would also update the SHA1 of that object in one of its xattrs. Following this, the client would update "parent" objects in the merkle tree. This can all be done asynchronously and in multiple atomic steps (in the case of multiple writers).

The MDS should also have a scrub mode that verifies these SHA1s and generates the merkle tree.

The use-case for this feature is to allow the efficient generation of a difference between two version of a file (from a snapshot). The rstats version for directories already allow you to zero-in on the files/directories that have been changed but generating a diff requires a full file read which is expensive, naturally.

History

#1 Updated by John Spray 7 months ago

I'm a fan. Questions that spring to mind:

- Do we apply this to all files, or only large ones based on some heuristic?
- When updating parents in the tree, we're generating lots of small IOPS. Does the Merkle tree metadata have to live in the data pool, or should it be in the metadata pool?
- As we create more interesting data layouts, maybe now is the time to implement layout-changing (+associated rewrites) on the MDS, so that choices of layout are not set in stone at file create time.

#2 Updated by Patrick Donnelly 7 months ago

John Spray wrote:

I'm a fan. Questions that spring to mind:

- Do we apply this to all files, or only large ones based on some heuristic?

All files I think. Small files (1 object) would just be adding an extra setxattr (the hash) for each write on the object. Is there a downside I'm not considering?

- When updating parents in the tree, we're generating lots of small IOPS. Does the Merkle tree metadata have to live in the data pool, or should it be in the metadata pool?

I'd be concerned about putting the MDS in the file I/O path as all writes would require a metadata update (in batch). Maybe it's not that significant since we already do that for the file size? The other issue is that inodes would now have unbounded size with the number of file data objects.

#3 Updated by John Spray 7 months ago

Patrick Donnelly wrote:

John Spray wrote:

I'm a fan. Questions that spring to mind:

- Do we apply this to all files, or only large ones based on some heuristic?

All files I think. Small files (1 object) would just be adding an extra setxattr (the hash) for each write on the object. Is there a downside I'm not considering?

The setxattr isn't necessarily cheap, depending on OSD implementation. We could have this hash anyway, and treat it as a bonus data integrity feature, or we could define the disk format such that an empty hash field just means "I'm a single object file, so far".

- When updating parents in the tree, we're generating lots of small IOPS. Does the Merkle tree metadata have to live in the data pool, or should it be in the metadata pool?

I'd be concerned about putting the MDS in the file I/O path as all writes would require a metadata update (in batch). Maybe it's not that significant since we already do that for the file size? The other issue is that inodes would now have unbounded size with the number of file data objects.

Doesn't have to be the MDS/inode structure that stores it -- perhaps not even need the existing metadata pool. We should think about introducing a second metadata pool for the metadata about file content (as opposed to the POSIX metadata about the filesystem).

The counter-argument to a separate pool would be that if people have SSDs, they can use them as bluestore db devices, so that the xattr writes to the data pool are all going to SSDs anyway. That's kind of an aggressive assumption about how fast/well-configured OSDs are going to be though. We can avoid the assumption by giving people the choice of using separate pool or not, unless it makes the implementation super hard (losing atomicity of xattrs vs data writes).

#4 Updated by Sage Weil 7 months ago

A few questions:

- What is the sha1 of? The object's content? That isn't necessarily known (e.g. 4 MB object where we overwrite a few random bytes)

- (how) do we guarantee that we upgrade the parents in the merkle tree after an overwrite? On a large file we'll have log(n) of them to update, and if we don't update them, we'll lose info about the change, right? Or do these get rebuilt in the F cap isn't flushed safely (and we have to complete the attr updates before we complete our flush and return the cap)?

- If the end goal is an efficient diff between snapshot versions, could we achieve something similar with a simple bitmap (bit per object) or bloom filter? It could be flushed to the MDS along with the F cap and bitwise-or'd from all clients, then stored efficiently in the inode. We could arbitrarily cap the size (say, 32 bytes = 256 bits) and just degrade to a coarse resolution on large files.

#5 Updated by Patrick Donnelly 6 months ago

John Spray wrote:

Patrick Donnelly wrote:

John Spray wrote:

I'm a fan. Questions that spring to mind:

- Do we apply this to all files, or only large ones based on some heuristic?

All files I think. Small files (1 object) would just be adding an extra setxattr (the hash) for each write on the object. Is there a downside I'm not considering?

The setxattr isn't necessarily cheap, depending on OSD implementation. We could have this hash anyway, and treat it as a bonus data integrity feature, or we could define the disk format such that an empty hash field just means "I'm a single object file, so far".

- When updating parents in the tree, we're generating lots of small IOPS. Does the Merkle tree metadata have to live in the data pool, or should it be in the metadata pool?

I'd be concerned about putting the MDS in the file I/O path as all writes would require a metadata update (in batch). Maybe it's not that significant since we already do that for the file size? The other issue is that inodes would now have unbounded size with the number of file data objects.

Doesn't have to be the MDS/inode structure that stores it -- perhaps not even need the existing metadata pool. We should think about introducing a second metadata pool for the metadata about file content (as opposed to the POSIX metadata about the filesystem).

The counter-argument to a separate pool would be that if people have SSDs, they can use them as bluestore db devices, so that the xattr writes to the data pool are all going to SSDs anyway. That's kind of an aggressive assumption about how fast/well-configured OSDs are going to be though. We can avoid the assumption by giving people the choice of using separate pool or not, unless it makes the implementation super hard (losing atomicity of xattrs vs data writes).

Another possibility is a RADOS namespace in the metadata pool that clients have access to. However, a negative aspect of using another pool to manage metadata about file content is that snapshots on the data pool don't capture this metadata. This wouldn't probably make the implementation significantly complex.

#6 Updated by Patrick Donnelly 6 months ago

Sage Weil wrote:

A few questions:

- What is the sha1 of? The object's content? That isn't necessarily known (e.g. 4 MB object where we overwrite a few random bytes)

I think this is doable and cheap with the help of the OSD via an object class? Or do the objectstore backends not necessarily do a read of (small) objects before doing a write?

- (how) do we guarantee that we upgrade the parents in the merkle tree after an overwrite? On a large file we'll have log(n) of them to update, and if we don't update them, we'll lose info about the change, right? Or do these get rebuilt in the F cap isn't flushed safely (and we have to complete the attr updates before we complete our flush and return the cap)?

Probably the MDS would need to clean up after F caps aren't flushed safely, yes. I'll need to think about this more.

- If the end goal is an efficient diff between snapshot versions, could we achieve something similar with a simple bitmap (bit per object) or bloom filter? It could be flushed to the MDS along with the F cap and bitwise-or'd from all clients, then stored efficiently in the inode. We could arbitrarily cap the size (say, 32 bytes = 256 bits) and just degrade to a coarse resolution on large files.

I might be missing something but wouldn't you need a bitmap for each inode and for each snapshot?

Another thought: it seems this feature would go well with byte-range write locks to help bound the cleanup effort for the MDS.

Also available in: Atom PDF