Linux splice(2), vmsplice(2), and tee(2) and pipes provide the building blocks for moving memory around between sockets and files without copying pages. Extend the bufferlist library to do this for you.
- Name (Affiliation)
- Name (Affiliation)
- Sage Weil (Affiliation)
- Haomai Wang
bufferlists are used throughout all fo the Ceph userland code to manipulate buffers. They are reference counted and support all of the concatenation and substring-type operations that you need to do without requiring any data copies. However, data is initially copied from kernel memory into a user buffer that the bufferlist libarary manages, and eventually copied into the page cache (or the reverse).
splice(2) and vmsplice(2) provide the necessary APIs to avoid those copies by allowing kernel pages to be moved around in pipes, but bufferlist doesn't know how to do that.
splice(2) is the main tool that lets you move data between a socket or fd and a pipe. The usual pattern is you splice into a pipe, and then splice the pipe to the final destination. vmsplice(2) lets you gift user pages to a pipe; it may or may not be useful. tee(2) lets you pull pages out of a pipe without consuming them.
The pipe is used as a 'buffer' or sorts; it's a container for page references and the tool that we use to move kernel pages around. Normally pipes have a max size of 64KB or something small, but you can call fnctl(2) with F_SETPIPE_SZ to make that larger--up to 1MB, normally. This lets a user process grab a bunch of data from a socket and carry them around for later splice(2)ing to a file.
Currently buffers are managed via buffer::raw; we would presumably add a buffer::raw variant that has a pipe to carry those pages.
Probably a new buffer splice() method(s) will need to be added to get pages to/from a socket or file and into a pipe.
Ideally, the buffers will never be inspected, since we don't have pointers to their content. If c_str() and friends are called, however, the bufferlist will need to map those pages to a usuable address to allow that. This should be possible with tee(2). One possibility is splicing/teeing to a temporary file (e.g. using the usual open, unlink, mmap, ..., close pattern) in a tmpfs mount that we then mmap(2). There may be a simpler/cleaner way.
- build a buffer::raw that handles a pipe
- build splice_from
- build splice_to
- build a set of unit tests that exercise the above and move data between files and sockets
- allow inspection of memory carried by buffer pipes (e.g., c_str(), use of iterator)
- some instrumentation to tell whether pages are making it from point a to b without being converted (so we can tell some c_str() call along the way isn't torpedoing our zero-copy)
- make sure new api bits are doxygen-documented in header