Project

General

Profile

Accelio RDMA Messenger

Summary

Add a flexible RDMA/Infiniband transport to Ceph, extending Ceph's Messenger. Integrate the new Messenger with Mon, OSD, MDS, librados (RadosClient), rados, and libcephfs (Client).

Accelio is by design a transport-independent messaging abstraction. Accelio currently runs over RDMA (infiniband or ethernet). A TCP implementation is in progress.

Owners

  • Matt Benjamin (CohortFS, LLC)
  • Eyal Salomon (Mellanox)

Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Current Status

A prototype implementation of the core functionality, plus work-in-progress integration with Mon, OSD, and RadosClient, and Rados has been produced.

It's possible to run most or all commands from "rados" over the Accelio transport, with mostly good results. Similar integration with the Ceph MDS and libcephfs has been started.

A client/server test harness (xio_client/xio_server) is available for experimentation and benchmarking. The current prototype has been benchmarked in two workloads, a max iiops workload based on MPing, and a throughput-testing workload based on MDataPing, an MPing with arbitrary data payload. Using MPing, XioMessenger can reliably achieve 500K round-trip messages, sustained, with less than 20% CPU utilization on a 56GB infiniband fabric. Using the MDataPing workload, we have observed 3200MB/s througput with a 64K data payload, with less than 15% CPU utilization, in the same configuration.

These numbers are preliminary. The iiops high watermark has, but the throughput high watermark has not yet been reproduced at Mellanox.

Detailed Description

Add new XioMessenger class deriving from Messenger, XioConnection deriving from Connection (plus supporting classes). Refactor Messenger and Connection to decrease coupling of Connection with SimpleMessenger. (completed)

Extend address representation/host-identification strategy just enough to permit sites to use RDMA transport when explicitly requested. One of the goals of Accelio is to support multi-path and transport agility (ie, select transparently), but this won't be available for a few months. Sites will eventually need transport selection, anyway. For this pass, don't we don't try to finalize transport selection, and don't add Accelio transport notation to entity notation or maps, instead since Infiniband RDMA uses TCP addressing, we use a port offset to find Accelio endpoints (this needs review and discusion). (minimal integration completed)

Add simple client/server programs for testing and initial benchmarking. (completed, review needed)

Extend the Ceph Mon, OSD, and MDS servers with XioMessenger instances alongside (some or all of) SimpleMessenger instances, and update all data paths to remove dependencies on SimpleMessenger, the most common of which is an assumption that requests can be responded to from a default messenger. Instead, replies should be sent to the originating messenger. (provisionally completed in Mon and OSD, tuning needed)

Extend the RadosClient and Client classes to incorporate XioMessenger instances alongside, or instead of (depending on options), SimpleMessenger, and expose the new functionality through the librados and libcephfs interfaces. In the first pass, just allow callers to specify a messenger transport to use (completed in RadosClient/librados)

Building on librados, extend the rados command tool, and verify all supported pool and object operations on the Accelio transport. Extend the qemu-kvm environment, and verify a VM block device workload on the Accelio transport. (rados command completed, more to follow)

Building on libcephfs, extend the Fuse client, and verify general CephFS file system functionality using standard workload generating tools (e.g., postmark, fio) (started).

Measure performance on both RADOS and CephFS workloads, aiming for wire throughput of not less than 2000 MB/s for aligned block data transfers of 64K bytes. (Whether end-to-end throughput increases are available without larger changes in OSD data paths is not established, but this question will be explored.)

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3