Project

General

Profile

Feature #46690

Add fscrypt support to the kernel CephFS client

Added by Luis Henriques about 1 year ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:

Description

As per the documentation fscrypt is a (kernel) "library which filesystems can hook into to support transparent encryption of files and directories". It basically allows users to transparently encrypt files and directories: a user can simply set the key in a directory and all files within that directory will be encrypted. Note that this means that only the file data will actually be encrypted; the only metadata the is encrypted is the filename, everything else (timestamps, file size, xatttributes, etc) are visible for other users as long as they have the permissions to access them.

So far, only local filesystems support it (ext4, f2fs and ubifs), but it looks like there's nothing preventing CephFS to support it.

History

#1 Updated by Luis Henriques about 1 year ago

At this point there are already a few open points requiring some investigation, namely:

  • How to handle encrypted filenames, which may contain illegal characters ('\0' and '/')
    This may require changes to the MDS:
    - MDS needs to either handle these characters itself. But how would clients not supporting fscrypt handle these? Or
    - Have the filenames sent to the MDS already base64-encoded. The problem here is that in this case we may have filenames > NAME_MAX. The MDS doesn't seem to care (this needs confirmation), but are other clients able handle it gracefully?
  • Have an xattr containing fscrypt context (contain data such as a nonce randomly generated) sent in cleartext to the MDS
    I'm not sure how sensitive is the data stored in this xattr. Having it sent in cleartext isn't desirable, but is it really critical from a cryptographic point of view? This fscrypt doesn't really need to be stored as an extended attribute, but what would the alternatives be, and would those be safer?
  • Setting the crypt xattr needs to be atomic with checking if a dir is empty
    Setting a directory as encrypted can only be done if that directory is empty. Thus, when a user sets a directory as encrypted we need to atomically check if it's empty and, if it is, set the xattr. The easiest solution would be to offload this to the MDS, as it's done with setting the quotas and the file/dir layouts. But it should also be doable with caps on the client.

NOTE: for obvious reasons, it would be great if we could handle everything on the client side without involving the MDS. Even if it involves some extra complexity, it may be worth the trouble so that we don't need to had things like feature bits and all the required paraphernalia.

#2 Updated by Patrick Donnelly about 1 year ago

  • Tracker changed from Bug to Feature

Luis Henriques wrote:

At this point there are already a few open points requiring some investigation, namely:

  • How to handle encrypted filenames, which may contain illegal characters ('\0' and '/')
    This may require changes to the MDS:
    - MDS needs to either handle these characters itself. But how would clients not supporting fscrypt handle these? Or
    - Have the filenames sent to the MDS already base64-encoded. The problem here is that in this case we may have filenames > NAME_MAX. The MDS doesn't seem to care (this needs confirmation), but are other clients able handle it gracefully?

Jeff and I just discussed this. We believe the right way to do this is have the client use base64 for the file names sent to the MDS but allow them to be greater than NAME_MAX. The MDS doesn't care about the larger file names but other clients will. The MDS can add a feature bit for long file names so that clients which don't understand the long file names will receive a truncated file name from readdir (anything else?). Lookup/path traversal using the truncated file names will always be valid with or without the feature bit. It's assumed that the encrypted file name is a UUID so there shouldn't be collisions. Because clients already reject the creation of file names larger than NAME_MAX, encrypted file names hould be the only scenario where file names are >NAME_MAX.

If the client has the feature bit set but doesn't have the encryption keys, it just passes the file name truncated to the kernel/userspace so that the userspace tools continue to work. In particular, you can rmr an encrypted tree without the keys.

#3 Updated by Jeff Layton about 1 year ago

Patrick Donnelly wrote:

If the client has the feature bit set but doesn't have the encryption keys, it just passes the file name truncated to the kernel/userspace so that the userspace tools continue to work. In particular, you can rmr an encrypted tree without the keys.

We can't use truncated names because those could easily collide. What we'd probably want to do is have the MDS use the fscrypt_nokey_name with those clients. From fs/crypto/fname.c in the kernel sources:

 * To meet all these requirements, we base64-encode the following                                   
 * variable-length structure.  It contains the dirhash, or 0's if the filesystem                    
 * didn't provide one; up to 149 bytes of the ciphertext name; and for                              
 * ciphertexts longer than 149 bytes, also the SHA-256 of the remaining bytes.                      
 *                                                                                                  
 * This ensures that each no-key name contains everything needed to find the                        
 * directory entry again, contains only legal characters, doesn't exceed                            
 * NAME_MAX, is unambiguous unless there's a SHA-256 collision, and that we only                    
 * take the performance hit of SHA-256 on very long filenames (which are rare).
 */                                                                                                 
struct fscrypt_nokey_name {                                                                         
        u32 dirhash[2];                                                                             
        u8 bytes[149];                                                                              
        u8 sha256[SHA256_DIGEST_SIZE];                                                              
}; /* 189 bytes => 252 bytes base64-encoded, which is <= NAME_MAX (255) */                                               

We'd have to reimplement that code in the MDS, but that should be relatively easy.

#4 Updated by Jeff Layton about 1 year ago

One problem though that the existing fscrypt-enabled fs' don't need to contend with is subtree mounts. With ceph we can mount at some point down in the tree rather than at the root of the fs, but that's much less common with local fs's.

Most of the existing tooling seems to be set up to mount first and then just make sure the key is instantiated before you do any activity in the encrypted dirs.

We could require that people mount the ciphertext path, as that should (presumably) work. Another idea might be to just ban subtree mounts on volumes with encryption enabled.

#5 Updated by Luis Henriques about 1 year ago

Patrick Donnelly wrote:

[...]

If the client has the feature bit set but doesn't have the encryption keys, it just passes the file name truncated to the kernel/userspace so that the userspace tools continue to work. In particular, you can rmr an encrypted tree without the keys.

The MDS needs to always send the full filename to the client if that client has supports for fscrypt. If it does support fscrypt but the key isn't loaded (yet), the client will handle the filename truncation locally. If later on the key is loaded, the client is already able to decrypt the real filename.

Jeff Layton wrote:

[...]

We'd have to reimplement that code in the MDS, but that should be relatively easy.

So, the MDS will always receive the full filename base64 encoded, which will need to be decoded to build the truncated name for clients that don't support file encryption. Which means that the MDS will need to:

  • know if a specific file is encrypted or not
  • know if a client has support for encrypted files, and
  • handle 2 different filenames for each (encrypted) file.

Won't this require more deep changes in file lookups on the MDS side?

#6 Updated by Jeff Layton about 1 year ago

Luis Henriques wrote:

The MDS needs to always send the full filename to the client if that client has supports for fscrypt. If it does support fscrypt but the key isn't loaded (yet), the client will handle the filename truncation locally. If later on the key is loaded, the client is already able to decrypt the real filename.

There are 3 different cases we need to handle:
- client that has the decryption key
- client that does not have the decryption key
- legacy clients (or those that do not support fscrypt)

The MDS will need to send full filenames to the first two and they can be taught to send the full filenames to the MDS. legacy clients should be sent fscrypt_nokey_names. That is more work on the MDS, but it should allow legacy clients to operate in encrypted directories in more or less the same way that new clients w/o the key would.

So, the MDS will always receive the full filename base64 encoded, which will need to be decoded to build the truncated name for clients that don't support file encryption. Which means that the MDS will need to:

  • know if a specific file is encrypted or not

It will already need to track this. encrypted files have an attached context (usually stored in a hidden xattr).

  • know if a client has support for encrypted files, and

We'll need a feature bit for this.

  • handle 2 different filenames for each (encrypted) file.

Won't this require more deep changes in file lookups on the MDS side?

The only real change we'd need is for legacy clients sending long filenames (>149 bytes). In that case, you might have to walk the whole omap for a directory to find the file. I don't think we'd want to store two versions of each name. It's probably best to just take the performance hit in the (rare) case of long filenames.

#7 Updated by Jeff Layton about 1 year ago

  • Assignee set to Jeff Layton

Writeup I did for people asking about this internally. It's a pretty broad (but not deep) overview:

CEPH + FSCRYPT PROJECT
======================

Goal
----
The idea is to add fscrypt support to cephfs. The primary use-case is
cloud providers that hand out subtrees of a cephfs to tenants. Those
tenants can then encrypt their data without having to give a key to the
cloud provider.

Overview
--------
fscrypt is infrastructure in the Linux kernel that filesystems can use
to encrypt both file names and data before writing this info to the
backing store. The ciphers used do not change the length of the data.

When the key is installed in the kernel the filesystem works seamlessly
(with some minor limitations).

Without a key, filenames appear scrambled and file contents are
inaccessible. Most operations in this state will get back ENOKEY as an
error, but you are able to traverse an encrypted tree and unlink
dentries (assuming you have the necessary permissions).

It's important to note that inode metadata is _not_ encrypted, so (for
instance) a stat() call will return the same results regardless of
whether the key is present or not.

This is considered a significant weakness by crypto experts, as it may
be possible to guess things like filenames or contents based on metadata
and location in the tree, which could give an attacker enough
information to crack the keys.

Only ext4, f2fs, and ubifs support fscrypt so far. These are all local
filesystems, so we expect that adding this to ceph may require some work
in the core fscrypt infrastructure.

Encrypting File Contents
------------------------
Encrypting the file data is fairly straightforward. Data is stored in
the pagecache unencrypted, so it must be encrypted before writing to
the server and decrypted after a read. Read decryption can be done in
place, but write encryption requires a bounce buffer.

This is somewhat more complex with cephfs than with existing
filesystems, as we have different data paths when FILE_BUFFER and
FILE_CACHE caps are present and not, and for O_DIRECT.

Filename Encryption
-------------------
Filenames are also encrypted, but that poses a problem. NUL ('\0') and
'/' are illegal in filename components, but the ciphertext version of
the filename may contain those characters as well as other non-printable
characters.

To handle this, filenames are generally presented as base64-encoded to
userland, but those names can be longer than the original and that can
run afoul of the NAME_MAX (255 char) limit on filename components. The
solution fscrypt uses is to just present base64 encoded names for all
filenames up to a certain length (149 chars).  After that point it
SHA-256 hashes the remainder of the name and appends the hash to the
presented filename. This is stored in struct fscrypt_nokey_name.

The MDS will need to store full filenames. We can either teach the MDS
to store the (binary) encrypted names (NULs and /'s and all), or we
could have it store encoded versions. Storing the encoded names is
probably the simplest and least dangerous way to do this. The MDS
doesn't have any inherent limit at NAME_MAX. Filenames are just OMAP
keys, so long names are not a problem there.

Clients that support fscrypt would always be sent and would always send
the long-form versions of the names (this may mean that we'd need to
store two different versions of the name in the dentry on the client to
handle lookups when the names are very long).

Clients that do not support fscrypt would be sent the
fscrypt_nokey_name.  The MDS will therefore need to be able to satisfy
lookups of these names for those clients.

We will need a cephfs feature bit for this so the MDS knows what sort of
client it's dealing with, and so that it can properly reject operations
on such files from legacy clients.

Per-file Encryption Context
---------------------------
Each inode has an attached encryption context. That contains info about
how the thing is encrypted and a randomly generated nonce. On existing
fscrypt-enabled fs's, this is usually stored in a hidden xattr.

One of the well-known ioctls will attach a new context to an empty
directory (thereby declaring that subtree as encrypted). Children of
these directories will automatically inherit a similar context.

While we'll need to support that ioctl, we'll probably want to generate
the nonce on the clients and send them along in the create operation
that's sent to the MDS. That should allow this to work in conjunction
with async creates.

This may mean we'll need to extend the create MDS operations (create,
mkdir, etc.), or add new operations in lieu of them.

Key Management
--------------
Key management is primarily handled via ioctl() calls to the kernel.
The main userland tool seems to be this one from google:

        https://github.com/google/fscrypt

It is not currently packaged for Fedora. It's worth considering how k8s
might manage the keys here.

Limitations
-----------
There are a couple of limitations we need to be concerned about:

- The root of a filesystem cannot be encrypted. Setting up encryption on
  a fs entails adding some files to a hidden directory at the root of
  it.  Once you do that, the root directory has files in it and can not
  be marked for encryption. My read is that that's fine here, as we'll
  want to mostly use this on a per-subvolume basis.

- A new encryption context cannot be set on a file that already has one.
  This means you cannot nest encrypted trees that use different policies.

Subtree Mounts
--------------
Because all of the existing fscrypt-enabled filesystems are local, they
don't need to contend with subtree mounts (mounting some subdirectory in
the tree other than the root). Subtree mounts are difficult here, as we
may not have the key to walk down the path until after the mount occurs,
as we can't call the ioctl's until then.

The simplest solution is to just not allow mounting directories with
encrypted filenames. IOW, you could mount the top level of an encrypted
subtree (as its parent is not encrypted), but anything below there would
not be mountable.

We may also need to take steps to prevent things like mounting a
(unencrypted) subdir and then trying to enable encryption on the mounted
fs.

#8 Updated by Jeff Layton about 1 year ago

The first step is to figure out how we store the encryption contexts. An xattr seems like an obvious choice but we could use a dedicated field in the inode too. Assuming we go with an xattr, at a high level:

- declare a separate "encryption" xattr namespace in which to store the encryption contexts. We can have newer clients filter those out in listxattr on their own, but we'd need to have the MDS filter them for legacy clients. It should also not be fetchable by legacy clients.

- MDS will need to vet that directories are empty before allowing the 'encryption.context' xattr to be set (special handling for the encryption.context xattr when the client doesn't hold caps).

- add a new feature bit for fscrypt

- ensure that the MDS can store names larger than NAME_MAX

- implement fscrypt callback operations (get_context, set_context, etc.)

#9 Updated by Luis Henriques about 1 year ago

Jeff Layton wrote:

The first step is to figure out how we store the encryption contexts. An xattr seems like an obvious choice but we could use a dedicated field in the inode too. Assuming we go with an xattr, at a high level:

- declare a separate "encryption" xattr namespace in which to store the encryption contexts. We can have newer clients filter those out in listxattr on their own, but we'd need to have the MDS filter them for legacy clients. It should also not be fetchable by legacy clients.

Another (obvious) thing that needs to be denied to legacy clients is to read encrypted files. This will require the MDS to only give these clients the caps to read metadata (PIN, AUTH, LINK, XATTR caps), but not to modify or read data (FILE cap).

- MDS will need to vet that directories are empty before allowing the 'encryption.context' xattr to be set (special handling for the encryption.context xattr when the client doesn't hold caps).

- add a new feature bit for fscrypt

- ensure that the MDS can store names larger than NAME_MAX

- implement fscrypt callback operations (get_context, set_context, etc.)

#10 Updated by Jeff Layton about 1 year ago

Ok, I've made some progress on this. I have a patchset that handles encrypting/decrypting filenames, but the contents are still cleartext currently. I've started looking at the content encryption, and that offers up a new and interesting problem:

The content encryption is done via block cipher where the blocks are FS_CRYPTO_BLOCK_SIZE (16 bytes). We can't deal in blocks that are smaller than that.

Block-based filesystems generally have 2 modes that they operate in -- either buffered I/O that goes through the pagecache or direct I/O which doesn't. Generally, DIO requires a certain alignment anyway, so this limitation is not an issue for them.

cephfs though has a 3rd mode -- when you don't have Fcb caps, then we just write straight through to the server. That's fine in most cases, but if the file is encrypted then we have a problem in this mode. Any I/O not aligned to 16 byte blocks would require a read/modify/write cycle and we can't really do that in a non-racy way.

The upshot is that we may need to return -EINVAL on write calls in this situation, even for "buffered" writes...which is weird because this entirely depends on whether we have caps or not, and that's impossible for userland applications to predict.

Fortunately, the read side shouldn't have any such restriction, but this will need to be carefully documented. For reads we can just extend the read to cover a complete block, and discard whatever we don't end up sending back to userland. It'll probably require some fiddly code, but that should be ok in principle.

#11 Updated by Jason Dillaman about 1 year ago

Jeff Layton wrote:

cephfs though has a 3rd mode -- when you don't have Fcb caps, then we just write straight through to the server. That's fine in most cases, but if the file is encrypted then we have a problem in this mode. Any I/O not aligned to 16 byte blocks would require a read/modify/write cycle and we can't really do that in a non-racy way.

Technically you could utilize CMPEXT RADOS ops combined with the WRITE to ensure any unaligned data from a RMW cycle hasn't been updated since you last read it, correct?

#12 Updated by Jeff Layton about 1 year ago

Jason Dillaman wrote:

Technically you could utilize CMPEXT RADOS ops combined with the WRITE to ensure any unaligned data from a RMW cycle hasn't been updated since you last read it, correct?

Good idea, that would actually work. It could be slow in contended cases, but that's probably ok.

Either way, I have to plumb this into several codepaths, and I'll probably start with the pagecache-based ops and DIO. Hopefully by the time I get to that point, the way forward will be clear.

#13 Updated by Jeff Layton about 1 year ago

Another potential issue: fscrypt encrypts data by block in the pagecache. The blocksize for ceph is usually set to the size of a stripe chunk (with it set to 4M max object size for default layouts). We'll need to reduce that to PAGE_SIZE for fscrypted inodes.

That part is fairly simple, but we have to also deal with the fact that we will need to have the MDS round up to PAGE_SIZE blocks when tracking the size of the inode. The MDS can truncate off extents that are beyond EOF (a'la do_truncate_range), but we need to take care that we always round the size up to the next page boundary.

We may need to track the "real" i_size for crypted inodes in a secondary field in the MDS. Basically, keep the rounded-up i_size in the normal MDS size field and track the real i_size separately.

#14 Updated by Jeff Layton about 1 year ago

A simpler option might be to just have the MDS round up to the next page boundary on a truncate for encrypted inodes, but still store and transmit the i_size as usual. That will mean that the client and MDS will need to agree on blocksize == PAGE_SIZE, but that's probably ok.

Oh, and I was wrong about FS_CRYPTO_BLOCK_SIZE. We need to work in blocks that are the same size that the underlying I/O path uses. For ceph, the blocks are PAGE_SIZE.

#15 Updated by Jeff Layton about 1 year ago

No, I take it back again:

 * @len:       Size of block to decrypt.  Doesn't need to be a multiple of the
 *              fs block size, but must be a multiple of FS_CRYPTO_BLOCK_SIZE.

For the record:

#define FS_CRYPTO_BLOCK_SIZE            16                                                          

So, we'd need to have the MDS round up the i_size to the next 16-byte block when truncating or hole punching.

#16 Updated by Jeff Layton 9 months ago

I've made quite a bit of progress on this, and I'm trying to tackle the data path part now, starting with the non-buffered I/O codepaths. This is the place where we need to potentially do a RMW cycle for writes.

When we do that, we'll end up with an array of pages. We'll need to be able to read in the first and last crypto blocks so that we can populate the parts of the page array that we're not writing to.

We'll need to allocate the same number of pages, regardless, but is it better to do a read over a single extent that covers all of the blocks being written (including those that will be completely replaced), or is it better to just read the extents at the beginning and end of the range?

I'm assuming that we'll want to do the latter...

#17 Updated by Luis Henriques 8 months ago

Jeff Layton wrote:

We'll need to allocate the same number of pages, regardless, but is it better to do a read over a single extent that covers all of the blocks being written (including those that will be completely replaced), or is it better to just read the extents at the beginning and end of the range?

To be honest, the usage of 'extent' in this context is confusing me.

Picking an example: doing a write of 18 bytes across 3 blocks of FS_CRYPTO_BLOCK_SIZE:

0     15     31     47
+------+------+------+
|     *|******|*     |
+------+------+------+

Since the 2nd block will be completely overwritten, only the first and third block need to be read. Or were you referring to objects and object layouts and I totally misunderstood your question?

#18 Updated by Jeff Layton 8 months ago

I meant "extent" as in the extents to be read/written in the OSD call. IOW, the ones added by osd_req_op_extent_osd_data_pages and the like. In practice, I think we'll want to use larger blocks than FS_CRYPTO_BLOCK_SIZE, which is a minimum value. I'm probably going to use 4k blocks, though we may want to make it tunable eventually.

Anyway, that is my question:

When I need to do a non-buffered, non-aligned write, I'll need to do a RMW cycle. Should I batch up a read that has extents covering the first and last crypto blocks that will be partially written, or just read in the whole thing?

It looks like you can add discontiguous extents to read and write operations, and I think they'll probably work. If I do that though, then it makes it a bit harder to share read helpers with the sync read code, but maybe that's ok.

Also available in: Atom PDF