Project

General

Profile

Feature #1236

libceph: set layout via virtual xattrs (libceph/cfuse)

Added by Greg Farnum almost 13 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

Apparently there actually is support in cfuse for ioctls, although it's hairy. Given that, we should support them in cfuse so that tools like cephfs will work even if you aren't using the kernel client.

Instead, get and set layouts via virtual xattrs. The idea would be to get/set the whole thing using some simple key/value syntax, or get/set individual fields.

ceph.dir.layout: "stripe_unit = 123, stripe_count = 456, ..." 
ceph.dir.layout.stripe_unit: "123"
ceph.dir.layout.stripe_count: "456"

etc.

The goal is then to support the same set of attributes on both the fuse and kernel clients. The client can either interpret the semantics of these attributes, or use a generic interface to communicate virtual xattrs to the MDS and implement it there once.

History

#1 Updated by Sage Weil over 12 years ago

  • Target version set to 12

#2 Updated by Sage Weil over 12 years ago

  • Tracker changed from Bug to Feature
  • Subject changed from Support ioctls in cfuse to libceph: set layout via virtual xattrs (libceph/cfuse)

#3 Updated by Sage Weil over 12 years ago

  • Target version deleted (12)

#4 Updated by Sage Weil over 11 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (11)

#5 Updated by Sage Weil over 11 years ago

  • translation missing: en.field_position set to 23

#6 Updated by Sage Weil over 11 years ago

  • translation missing: en.field_position deleted (23)
  • translation missing: en.field_position set to 25

#7 Updated by Sage Weil over 11 years ago

  • translation missing: en.field_position deleted (25)
  • translation missing: en.field_position set to 24

#8 Updated by Sage Weil about 11 years ago

  • Description updated (diff)
  • Status changed from New to 12

#9 Updated by Greg Farnum about 11 years ago

We're still thinking through the implications of the best way to implement this. Nonetheless there are people using hacked-together versions of it, so:
1) Design a system to do this and convince the team that it's maintainable and efficient
2) Implement that system.

My inclination right now is that the MDS should take responsibility for the parsing, and it should generate the fields when clients request the Inode. Modern clients should pass write requests to the ceph.* xattr namespace to the MDS synchronously; old ones will have to be content with not behaving quite right. But we could also do double-parsing or something.

#10 Updated by Greg Farnum about 11 years ago

  • translation missing: en.field_position deleted (36)
  • translation missing: en.field_position set to 1

#11 Updated by Sage Weil about 11 years ago

  • translation missing: en.field_story_points set to 8
  • translation missing: en.field_position deleted (1)
  • translation missing: en.field_position set to 1

#12 Updated by Sage Weil about 11 years ago

  • Target version set to v0.57c
  • translation missing: en.field_position deleted (1)
  • translation missing: en.field_position set to 1

#13 Updated by Sage Weil about 11 years ago

  • Assignee set to Sage Weil

#14 Updated by Sage Weil about 11 years ago

Translating any ceph.* setxattrs into a sync setxattr and handling it on the MDS seems like an easy win. I can't think of any reason to do anything on the client, unless it is a client-specific behavior. And doing this now doesn't preclude that. For instance, setting ceph.file.layout will go to the mds as a SETXATTR instead of being translated to a SETLAYOUT by the client (adding no value), but maybe setting ceph.do_magic_thing will be captured by the client at some future date for some other purpose. This path gives us the simplicity of one implementation when we can, and flexibility to do something more complex later, if we so choose.

listxattr and getxattr on the generic vxattrs is the tricky part. I see three options:

1- A naive implementation would generate those at the MDS too, but would end up shipping a huge map<string,bufferptr> for every inode from mds->client, a huge waste.
2- All getxattr could be implemented client-side, both in libcephfs and the kernel.
3- The client session open would include a list of which vxattrs the MDS defines, for files and dirs. These would be included in all listxattr responses (added into the results by the client). getxattr on those xattrs would be a synchronous request to the MDS.

As with setxattr, option #3 does not preclude also implementing certian vxattr semantics client side. For example, if the client wants to define ceph.inode_lru_position or something specific to the client implementation, it can do that. Or, if it wants to generate xattr content for select generic vxattrs (like ceph.{file,dir}.layout.*) client-side, it is free to optimize that too.

One thing to keep in mind, in general: any tool that dumps xattrs (like tar --xattr, on those lucky platforms that have it) will get every xattr that appears in listxattr. In general, that is fast, since xattr content is returned from the MDS. Any vxattr the MDS defines for get (option #3) that isn't optimized client-side will drastically slow down those tools with a sync MDS request for every vxattr.

My vote: MDS-side setxattr, client-side getxattr (option #2).

#15 Updated by Greg Farnum about 11 years ago

How large would a simple "layout" xattr actually be in comparison to the shipped inodes? I'm not sure the size is so significant as I glance over what's included in an InodeStat.
The reason that I ask is that implementing the getxattr stuff client-side means we're stuck with whatever the client side can represent. Which in particular sucks if we want to change the meaning of something in a way that is compatible with the protocol, but not with the client's implementation.

#16 Updated by Sage Weil about 11 years ago

Greg Farnum wrote:

How large would a simple "layout" xattr actually be in comparison to the shipped inodes? I'm not sure the size is so significant as I glance over what's included in an InodeStat.

The layout struct is 28 bytes, but encoded as strings it would be almost 100 for just the ceph.*.layout. If we add the sub-fields, more. Server-side also means we can't implement hidden ceph.dir.layout.field that doesn't appear in listxattr but does in getxattr. At least not without adding new support for 'hidden' xattrs.

The reason that I ask is that implementing the getxattr stuff client-side means we're stuck with whatever the client side can represent. Which in particular sucks if we want to change the meaning of something in a way that is compatible with the protocol, but not with the client's implementation.

The layout stuff in particular is so closely tied to what the client already has to support, I'm not worried about it. If we had a layout to describe, say, content-addressible-storage, then yeah, the client couldn't manipulate it. It also couldn't do anything with the file data either.

#17 Updated by Sage Weil about 11 years ago

  • Status changed from 12 to Fix Under Review

wip-vxattr (ceph.git) and wip-vxattrs (ceph-client.git). There's a test script that passes on both fuse and kclient.

ceph.dir or file.layout is rendered on the client from i_layout. getxattr on layout fields via ceph.foo.layout.$field is handled on the client, setxattr for ceph.* passes through to the mds.

#18 Updated by Sage Weil about 11 years ago

  • Status changed from Fix Under Review to Resolved

commit:1564c3a0a3efbde5a326001586238fde8f6648ad for userspace bits.

the kernel bits still need review.. opening separate task for that.

Also available in: Atom PDF