Project

General

Profile

Increasing Ceph portability » History » Version 1

Jessica Mack, 06/21/2015 04:20 AM

1 1 Jessica Mack
h1. Increasing Ceph portability
2
3
h3. Summary
4
5
Support for Ceph is currently limited to Linux, with additional restrictions on file system compatibility. However, there are many benefits to extending Ceph support to new environments, such as expanding the contributor/user base, increasing confidence in correctness, and improving code maintainability.
6
A couple years back a monolithic patch (9fde4d) was merged that would allow Ceph to build on FreeBSD. This was a great start, especially in identifying a bulk of the pivot points for portability. However, the patch was primarily based on an ad-hoc scattering of pre-processor macros. This works great when managing a few scenarios, but as the compatibility matrix grows, a principeled approach to factoring out platform-specific functionality is needed.
7
8
h3. Owners
9
10
* Noah Watkins (Inktank)
11
12
h3. Interested Parties
13
14
* Noah Watkins (Inktank)
15
* Sage Weil (Inktank)
16
* Yehuda Sadeh (Inktank)
17
* Name
18
19
h3. Current Status
20
21
h4. Operating System
22
23
Recent efforts aimed at increasing portability have focused on building Ceph on OSX, but the work in general is not specific to OSX.
24
* The primary effort here is taking place in wip-port
25
** Lots of stuff is already done, or atleast has been attempted
26
** Stuff near HEAD is less pretty
27
* Currently building on OSX 10.8 and FreeBSD 9.1
28
** With the exception of a few unit tests that not yet been ported
29
* Test coverage
30
** Limited to a single-node setup (vstart)
31
** A selection of RADOS and libcephfs tests
32
33
This work has focused on removing platform-specific checks in favor of configure-time feature tests, providing generic feature replacements, and documenting where OSX has optimized alternatives.
34
35
h4. %{color:gold}Detailed Description%
36
37
The overall set of issues is quite large (this page only has a partial list). So, the ordering of this list of issues is supposed to be semi-significant. I've tried to order it by (1st) broadness (e.g. stuff in libcommon), (2nd) likely users (e.g. FUSE/librados), and finally boring topics, optional features, and portability issues in unit tests.
38
39
h3. Locking and AtomicOps
40
41
h4. Pthread Spinlock
42
43
Ceph uses pthread_spinlock_t, but this is not a portable feature. Initialization function signature for pthread_mutex_t  and pthread_spinlock_t are not compatible, so a simple typedef won't work to revert to a mutex implementation.
44
* Introduce ceph_spinlock_t in include/spinlock.h
45
* Alternative implementations
46
** Generic: pthread_mutex_t
47
** OSX: OSSSpinLockLock?
48
* Complications
49
** Ceph specifies PTHREAD_PROCESS_SHARED, but this isn't portable. I cannot find any instance of a lock being allocated in shared memory, so if that really is the case, there aren't any complications :)
50
51
h4. Atomic Primitives
52
53
Ceph implements atomic_t type for atomic integer operations using libatomic-ops, and contains a backup implementation based on pthread_spinlock_t.
54
* Switch the backup implementation to use ceph_spinlock_t
55
* Alternative implementations
56
** Contribute packaging and testing of libatomic-ops to OSX package manager Homebrew
57
** There are a number of alternative implementations used in Chromium that could be repurposed (https://code.google.com/p/chromium/codesearch#chromium/src/base/atomicops.h&q=atomic&sq=package:chromium&type=cs&l=5)
58
59
h4. sem_timedwait
60
61
The ceph context service thread uses sem_timedwait to implement the heartbeat interval, which isn't a portable semaphore function.
62
* Currently we disable the heartbeat interval and revert to plain sem_wait in all cases.
63
* Alternatives
64
** Simulate with sem_trywait, nanosleep, and a loop.
65
** Build some sort of counting semaphore combined with pthread_cond_timedwait
66
*** pthread_mutex is not signal safe... sem_post from SIGHUP
67
68
h3. Integer Types
69
70
Non-standard integer types (Linux-specific) such as __u32 etc... might be found exported by kernel headers being reused, as well as the bitwise types for Sparse (e.g. __be32).
71
* How much are these being used for their intended semantics?
72
* Internal use
73
** Largely not a problem as we can define replacements
74
* Exported headers
75
** Examples
76
*** librados.hpp: __u8 for crush rule
77
*** buffer.h: __u32 for crc
78
** Could provide backup definitions...
79
80
h3. Errno Value Portability
81
82
There are two issues. First, some errno values aren't available off Linux (e.g. high-numbered things like EKEYEXPIRED). The second is that the same errno macro may have a different value on different platforms. The former case is solved by making sure we are using some standard, common set or defining our own non-conflicting values to use internally.
83
The larger issue is dealing with errno values that are leaked out through an API from a remote system. A hypothetical example would be an Linux OSD somehow returning EAGAIN (11) over the network, through a fuse client on OSX, where a user would have EAGAIN = 35.
84
* Create an internal errno.h that handles definitions of missing errno defiintion
85
* Not clear what to do about controlled exported errno values, or just handle them on a case-by-case basis
86
87
h3. FUSE Extended Attributes
88
89
On OSX an optional offset parameter allows partial xattr writing. The call signature is different, and is trivial to handle with ifdefs. Currently we return -EOPNOTSUPP in any case in which a non-zero offset is provided. From the OSX manpage:
90
<pre>
91
In the current implementation, only the resource fork extended attribute makes use of this argument. Forall others, position is reserved and should be set to zero.
92
</pre>
93
A cursory investigation seems to indicate the attribute itself is used frequently, but no clue on how frequently non-zero positions are used.
94
95
h3. Networking and Endianness Oh My!
96
97
* networking
98
** The rsockets blueprint may be relevant
99
*** [[msgr - implement infiniband support via rsockets]]
100
** MSG_NOSIGNAL / MSG_MORE
101
** struct differences
102
** htons/ntohs
103
104
h3. O_DIRECT, O_SYNC, O_RSYNC, O_DSYNC
105
106
*FileJournal*
107
As for the synchronization flags I'm less clear. There may need to be synchronization calls associated with writes (or our own write_sync that handles the platform-specific extra calls).
108
* Is there a well-defined contract for the backing file system?
109
* OSX has F_NOCACHE to replace O_DIRECT.
110
111
h4. Client
112
113
The O_SYNC flag and friends are accepted by libcephfs, but may not be defined on non-Linux platform. Defining CEPH_O_SYNC, CEPH_O_DSYNC etc... could solve this issue.
114
115
h3. FUSE IOCTL Flags
116
117
Ceph FUSE defines CEPH_IOC_GET_LAYOUT and friends in terms of the Linux IOCTL magic numbering macros, which are Linux specific. -These do not actually need to take on the same value as the IOCTL numbers used in the kclient- It seems like these should be identical so software works with FUSE or kclient.
118
119
h3. Final Log Flushing
120
121
Prior to exiting Log is flushed using on_exit(func, context) feature. The portable equivalent is atexit(function), but can't record the Log instance context. Building a small facility to keep track of the Log instances that should be flushed should be straight forward. If we aren't racing with ~Log, then can we register an atexit on an Log instance method? How much potential log context might we be losing by using an auto_ptr on Log? Even atexit won't run in situations like sigkill...
122
123
h3. FileStore
124
125
Cross reference to the ZFS blueprint, which looks to already start addressing some of the abstraction of the underlying file system.
126
[[osd - ceph on zfs]]
127
128
h4. File Extent Mapping
129
130
The FIEMAP IOCTl is Linux-specific.
131
* If FS_IOC_IOCTL is not defined, do_fiemap will unconditionally return -EOPNOTSUPP.
132
133
h4. File System
134
135
There are a number of #ifdef __linux__ guards protecting BTRFS features in place like FileStore::mkfs, FileStore::mount. These are presumably also places where conditional checks for ZFS features will be as well. Ideally we can factor these features out, but for the time being the current guards work alright.
136
137
h3. Internal Optional Stuffs
138
139
Characterized by being non-user facing, non-critical, and/or optimizations not affecting correctness.
140
* posix_fadvise
141
** give os data usage hints
142
** generic strategy: don't do anything
143
* posix_fallocate
144
** pre-allocate journal to avoid fragmentation
145
** OSX: fctnl(F_PREALLOCATE)
146
** generic strategy: write a few zeros to end of each file block
147
* get_process_name
148
** Linux: prctl(PR_GET_NAME)
149
** OSX: http://stackoverflow.com/questions/3018054/retrieve-names-of-running-processes
150
** generic backup: return "unknown" or override in each executable
151
152
h3. Types and std::hash
153
154
The types uint64_t and int64_t do not seem to be defined in the standard headers on OSX. A backup implementation is provided in src/include/types.h. A backup is also provided for hashing pthread_t.
155
156
h3. TEMP_FAILURE_RETRY
157
158
This is trival to reproduce for a new platform. The only due diligence that needs to be done is to check the retval/errno semantics for each I/O routine being wrapped to make sure the loop conditional is correct.
159
160
h3. Thread-local Storage
161
162
Despite some indication online that people have gotten __thread to work with clang on newer OSX, this still isn't working. There is a single instance of __thread in use by rados_sync.cc.
163
* Switch use of __thread to use pthread_[gs]etspecific
164
165
h3. Building on Case-insensitive File Systems
166
167
Automake will produce intermediate files for Pipe.cc and pipe.c with identical names on a case-insensitive file system.
168
* Rename pipe.c to something like pipe_cloexec.c
169
170
The AUTOMAKE_OPTIONS subdirs-objects will apparently in the future be required (places object files in the directory with their respective source file). Since Pipe.cc and pipe.c are in different sub directories, this issue would go away with this automake option.
171
172
h3. Work items
173
174
h4. Coding tasks
175
176
# Split wip-port into wip-port-upstream (reviewable) and wip-port (remainder parts)
177
# Get reviews for wip-port-upstream
178
# Continue to split up left overs in wip-port
179
# Repeat feedback loop
180
181
h4. Build / release tasks
182
183
# Create some interesting githbuilders :)
184
## FreeBSD, Solaris, OSx86 (shhh!)
185
## Big endian versions with Qemu?
186
# Normal release process with portablity changes
187
# Create OSX Homebrew formula for some stable release
188
189
h4. Documentation tasks
190
191
# Task 1
192
# Task 2
193
# Task 3
194
195
h4. Deprecation tasks
196
197
# Task 1
198
# Task 2
199
# Task 3