Project

General

Profile

Increasing Ceph portability

Summary

Support for Ceph is currently limited to Linux, with additional restrictions on file system compatibility. However, there are many benefits to extending Ceph support to new environments, such as expanding the contributor/user base, increasing confidence in correctness, and improving code maintainability.
A couple years back a monolithic patch (9fde4d) was merged that would allow Ceph to build on FreeBSD. This was a great start, especially in identifying a bulk of the pivot points for portability. However, the patch was primarily based on an ad-hoc scattering of pre-processor macros. This works great when managing a few scenarios, but as the compatibility matrix grows, a principeled approach to factoring out platform-specific functionality is needed.

Owners

  • Noah Watkins (Inktank)

Interested Parties

  • Noah Watkins (Inktank)
  • Sage Weil (Inktank)
  • Yehuda Sadeh (Inktank)
  • Name

Current Status

Operating System

Recent efforts aimed at increasing portability have focused on building Ceph on OSX, but the work in general is not specific to OSX.
  • The primary effort here is taking place in wip-port
    • Lots of stuff is already done, or atleast has been attempted
    • Stuff near HEAD is less pretty
  • Currently building on OSX 10.8 and FreeBSD 9.1
    • With the exception of a few unit tests that not yet been ported
  • Test coverage
    • Limited to a single-node setup (vstart)
    • A selection of RADOS and libcephfs tests

This work has focused on removing platform-specific checks in favor of configure-time feature tests, providing generic feature replacements, and documenting where OSX has optimized alternatives.

Detailed Description

The overall set of issues is quite large (this page only has a partial list). So, the ordering of this list of issues is supposed to be semi-significant. I've tried to order it by (1st) broadness (e.g. stuff in libcommon), (2nd) likely users (e.g. FUSE/librados), and finally boring topics, optional features, and portability issues in unit tests.

Locking and AtomicOps

Pthread Spinlock

Ceph uses pthread_spinlock_t, but this is not a portable feature. Initialization function signature for pthread_mutex_t and pthread_spinlock_t are not compatible, so a simple typedef won't work to revert to a mutex implementation.
  • Introduce ceph_spinlock_t in include/spinlock.h
  • Alternative implementations
    • Generic: pthread_mutex_t
    • OSX: OSSSpinLockLock?
  • Complications
    • Ceph specifies PTHREAD_PROCESS_SHARED, but this isn't portable. I cannot find any instance of a lock being allocated in shared memory, so if that really is the case, there aren't any complications :)

Atomic Primitives

Ceph implements atomic_t type for atomic integer operations using libatomic-ops, and contains a backup implementation based on pthread_spinlock_t.

sem_timedwait

The ceph context service thread uses sem_timedwait to implement the heartbeat interval, which isn't a portable semaphore function.
  • Currently we disable the heartbeat interval and revert to plain sem_wait in all cases.
  • Alternatives
    • Simulate with sem_trywait, nanosleep, and a loop.
    • Build some sort of counting semaphore combined with pthread_cond_timedwait
      • pthread_mutex is not signal safe... sem_post from SIGHUP

Integer Types

Non-standard integer types (Linux-specific) such as __u32 etc... might be found exported by kernel headers being reused, as well as the bitwise types for Sparse (e.g. __be32).
  • How much are these being used for their intended semantics?
  • Internal use
    • Largely not a problem as we can define replacements
  • Exported headers
    • Examples
      • librados.hpp: __u8 for crush rule
      • buffer.h: __u32 for crc
    • Could provide backup definitions...

Errno Value Portability

There are two issues. First, some errno values aren't available off Linux (e.g. high-numbered things like EKEYEXPIRED). The second is that the same errno macro may have a different value on different platforms. The former case is solved by making sure we are using some standard, common set or defining our own non-conflicting values to use internally.
The larger issue is dealing with errno values that are leaked out through an API from a remote system. A hypothetical example would be an Linux OSD somehow returning EAGAIN (11) over the network, through a fuse client on OSX, where a user would have EAGAIN = 35.
  • Create an internal errno.h that handles definitions of missing errno defiintion
  • Not clear what to do about controlled exported errno values, or just handle them on a case-by-case basis

FUSE Extended Attributes

On OSX an optional offset parameter allows partial xattr writing. The call signature is different, and is trivial to handle with ifdefs. Currently we return -EOPNOTSUPP in any case in which a non-zero offset is provided. From the OSX manpage:

In the current implementation, only the resource fork extended attribute makes use of this argument. Forall others, position is reserved and should be set to zero.

A cursory investigation seems to indicate the attribute itself is used frequently, but no clue on how frequently non-zero positions are used.

Networking and Endianness Oh My!

O_DIRECT, O_SYNC, O_RSYNC, O_DSYNC

FileJournal
As for the synchronization flags I'm less clear. There may need to be synchronization calls associated with writes (or our own write_sync that handles the platform-specific extra calls).
  • Is there a well-defined contract for the backing file system?
  • OSX has F_NOCACHE to replace O_DIRECT.

Client

The O_SYNC flag and friends are accepted by libcephfs, but may not be defined on non-Linux platform. Defining CEPH_O_SYNC, CEPH_O_DSYNC etc... could solve this issue.

FUSE IOCTL Flags

Ceph FUSE defines CEPH_IOC_GET_LAYOUT and friends in terms of the Linux IOCTL magic numbering macros, which are Linux specific. These do not actually need to take on the same value as the IOCTL numbers used in the kclient It seems like these should be identical so software works with FUSE or kclient.

Final Log Flushing

Prior to exiting Log is flushed using on_exit(func, context) feature. The portable equivalent is atexit(function), but can't record the Log instance context. Building a small facility to keep track of the Log instances that should be flushed should be straight forward. If we aren't racing with ~Log, then can we register an atexit on an Log instance method? How much potential log context might we be losing by using an auto_ptr on Log? Even atexit won't run in situations like sigkill...

FileStore

Cross reference to the ZFS blueprint, which looks to already start addressing some of the abstraction of the underlying file system.
osd - ceph on zfs

File Extent Mapping

The FIEMAP IOCTl is Linux-specific.
  • If FS_IOC_IOCTL is not defined, do_fiemap will unconditionally return -EOPNOTSUPP.

File System

There are a number of #ifdef linux guards protecting BTRFS features in place like FileStore::mkfs, FileStore::mount. These are presumably also places where conditional checks for ZFS features will be as well. Ideally we can factor these features out, but for the time being the current guards work alright.

Internal Optional Stuffs

Characterized by being non-user facing, non-critical, and/or optimizations not affecting correctness.
  • posix_fadvise
    • give os data usage hints
    • generic strategy: don't do anything
  • posix_fallocate
    • pre-allocate journal to avoid fragmentation
    • OSX: fctnl(F_PREALLOCATE)
    • generic strategy: write a few zeros to end of each file block
  • get_process_name

Types and std::hash

The types uint64_t and int64_t do not seem to be defined in the standard headers on OSX. A backup implementation is provided in src/include/types.h. A backup is also provided for hashing pthread_t.

TEMP_FAILURE_RETRY

This is trival to reproduce for a new platform. The only due diligence that needs to be done is to check the retval/errno semantics for each I/O routine being wrapped to make sure the loop conditional is correct.

Thread-local Storage

Despite some indication online that people have gotten __thread to work with clang on newer OSX, this still isn't working. There is a single instance of __thread in use by rados_sync.cc.
  • Switch use of _thread to use pthread[gs]etspecific

Building on Case-insensitive File Systems

Automake will produce intermediate files for Pipe.cc and pipe.c with identical names on a case-insensitive file system.
  • Rename pipe.c to something like pipe_cloexec.c

The AUTOMAKE_OPTIONS subdirs-objects will apparently in the future be required (places object files in the directory with their respective source file). Since Pipe.cc and pipe.c are in different sub directories, this issue would go away with this automake option.

Work items

Coding tasks

  1. Split wip-port into wip-port-upstream (reviewable) and wip-port (remainder parts)
  2. Get reviews for wip-port-upstream
  3. Continue to split up left overs in wip-port
  4. Repeat feedback loop

Build / release tasks

  1. Create some interesting githbuilders :)
    1. FreeBSD, Solaris, OSx86 (shhh!)
    2. Big endian versions with Qemu?
  2. Normal release process with portablity changes
  3. Create OSX Homebrew formula for some stable release

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3