Project

General

Profile

Tuning for All Flash Deployments » History » Version 21

Patrick McGarry, 01/14/2017 09:01 PM

1 1 Patrick McGarry
h1. Tuning for All Flash Deployments
2
3
Ceph Tuning and Best Practices for All Flash Intel® Xeon® Servers
4
Last updated:  January 2017
5
6
h2. Table of Contents
7 9 Patrick McGarry
8 14 Patrick McGarry
# +*[[Tuning_for_All_Flash_Deployments#Introduction|Introduction]]*+
9
# +*[[Tuning_for_All_Flash_Deployments#Ceph-Storage-Hardware-Guidelines|Ceph Storage Hardware Guidelines]]*+
10
# +*[[Tuning_for_All_Flash_Deployments#Intel-Tuning-and-Optimization-Recommendations-for-Ceph|Intel Tuning and Optimization Recommendations for Ceph]]*+
11
## +[[Tuning_for_All_Flash_Deployments#Server-Tuning|Server Tuning]]+
12
### +[[Tuning_for_All_Flash_Deployments#Ceph-Client-Configuration|Ceph Client Configuration]]+
13 1 Patrick McGarry
### +[[Tuning_for_All_Flash_Deployments#Ceph-Storage-Node-NUMA-Tuning|Ceph Storage Node NUMA Tuning]]+
14 14 Patrick McGarry
## +[[Tuning_for_All_Flash_Deployments#Memory-Tuning|Memory Tuning]]+
15
## +[[Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning|NVMe SSD partitioning]]+
16 15 Patrick McGarry
## +[[Tuning_for_All_Flash_Deployments#OS-Tuning|OS Tuning]]+ (must be done on all Ceph nodes)
17 14 Patrick McGarry
### +[[Tuning_for_All_Flash_Deployments#Kernel-Tuning|Kernel Tuning]]+
18
### +[[Tuning_for_All_Flash_Deployments#Filesystem-considerations|Filesystem considerations]]+
19
### +[[Tuning_for_All_Flash_Deployments#Disk-read-ahead|Disk read ahead]]+
20
### +[[Tuning_for_All_Flash_Deployments#OSD-RADOS|OSD: RADOS]]+
21
## +[[Tuning_for_All_Flash_Deployments#RBD-Tuning|RBD Tuning]]+
22
## +[[Tuning_for_All_Flash_Deployments#RGW-Rados-Gateway-Tuning|RGW: Rados Gateway Tuning]]+
23
## +[[Tuning_for_All_Flash_Deployments#Erasure-Coding-Tuning|Erasure Coding Tuning]]+
24
# +*[[Tuning_for_All_Flash_Deployments#Appendix|Appendix]]*+
25 17 Patrick McGarry
# +*[[Tuning_for_All_Flash_Deployments#Sample-Cephconf|Sample Ceph.conf]]*+
26 18 Patrick McGarry
# +*[[Tuning_for_All_Flash_Deployments#Sample-sysctlconf|Sample sysctl.conf]]*+
27 14 Patrick McGarry
# +*[[Tuning_for_All_Flash_Deployments#All-NVMe-Ceph-Cluster-Tuning-for-MySQL-workload|All-NVMe Ceph Cluster Tuning for MySQL workload]]*+
28 19 Patrick McGarry
## +[[Tuning_for_All_Flash_Deployments#Cephconf|Ceph.conf]]+
29 14 Patrick McGarry
## +[[Tuning_for_All_Flash_Deployments#CBT-YAML|CBT YAML]]+
30 21 Patrick McGarry
## +[[Tuning_for_All_Flash_Deployments#MySQL-configuration-file-mycnf)|MySQL configuration file (my.cnf)]]+
31 14 Patrick McGarry
# +*[[Tuning_for_All_Flash_Deployments#Sample-Ceph-Vendor-Solutions|Sample Ceph Vendor Solutions]]*+
32 1 Patrick McGarry
33
34
35
36
h3. Introduction 
37
38 8 Patrick McGarry
Ceph is a scalable, open source, software-defined storage offering that runs on commodity hardware. Ceph has been developed from the ground up to deliver object, block, and file system storage in a single software platform that is self-managing, self-healing and has no single point of failure. Because of its highly scalable, software defined storage architecture, can be a powerful storage solution to consider. 
39
This document covers Ceph tuning guidelines specifically for all flash deployments based on extensive testing by Intel with a variety of system, operating system and Ceph optimizations to achieve highest possible performance for servers with Intel® Xeon® processors and Intel® Solid State Drive Data Center (Intel® SSD DC) Series. Details of OEM system SKUs and Ceph reference architectures for targeted use-cases can be found on ceph.com web-site. 
40
 
41 1 Patrick McGarry
h3. Ceph Storage Hardware Guidelines  
42 8 Patrick McGarry
43 1 Patrick McGarry
* *Standard* configuration is ideally suited for throughput oriented workloads (e.g., analytics,  DVR). Intel® SSD Data Center P3700 series is recommended to achieve best possible performance while balancing the cost.
44 8 Patrick McGarry
45
| CPU | Intel® Xeon® CPU E5-2650v4 or higher |
46
| Memory | Minimum of 64 GB|
47 1 Patrick McGarry
| NIC | 10GbE |
48 10 Patrick McGarry
| Disks | 1x 1.6TB P3700 + 12 x 4TB HDDs (1:12 ratio) / P3700 as Journal and caching |
49 8 Patrick McGarry
| Caching software | Intel Cache Acceleration Software for read caching, option: Intel® Rapid Storage Technology enterprise/MD4.3 |
50
51 1 Patrick McGarry
* *TCO optimized* configuration provides best possible performance for performance centric workloads (e.g., database) while achieving the TCO with a mix of SATA SSDs and NVMe SSDs. 
52 8 Patrick McGarry
53
| CPU | Intel® Xeon® CPU E5-2690v4 or higher |
54
| Memory | 128 GB or higher |
55
| NIC | Dual 10GbE 
56
| Disks | 1x 800GB P3700 + 4x S3510 1.6TB |
57
58
* *IOPS optimized* configuration provides best performance for workloads that demand low latency using all NVMe SSD configuration. 
59
60
| CPU | Intel® Xeon® CPU E5-2699v4 |
61
| Memory | 128 GB or higher |
62
| NIC | 1x 40GbE, 4x 10GbE |
63
| Disks | 4 x P3700 2TB |
64
65 12 Patrick McGarry
h3. Intel Tuning and Optimization Recommendations for Ceph
66 11 Patrick McGarry
67 12 Patrick McGarry
h3. Server Tuning
68 16 Patrick McGarry
69
h3. Ceph Client Configuration
70 10 Patrick McGarry
71 1 Patrick McGarry
In a balanced system configuration both client and storage node configuration need to be optimized to get the best possible cluster performance. Care needs to be taken to ensure Ceph client node server has enough CPU bandwidth to achieve optimum performance. Below graph shows the end to end performance for different client CPU configurations for block workload.
72
73 8 Patrick McGarry
!1-cpu-cores-client.png!
74 1 Patrick McGarry
Figure 1: Client CPU cores and Ceph cluster impact
75 8 Patrick McGarry
76
h3. Ceph Storage Node NUMA Tuning
77
78 1 Patrick McGarry
In order to avoid latency, it is important to minimize inter-socket communication between NUMA nodes to service client IO as fast as possible and avoid latency penalty.  Based on extensive set of experiments conducted in Intel, it is recommended to pin Ceph OSD processes on the same CPU socket that has NVMe SSDs, HBAs and NIC devices attached. 
79 8 Patrick McGarry
80
!2-numa-mode-config.png! 
81 1 Patrick McGarry
Figure 2: NUMA node configuration and OSD assignment
82
83 8 Patrick McGarry
> *_Ceph startup scripts need change with setaffinity=" numactl --membind=0 --cpunodebind=0 "_
84
85 1 Patrick McGarry
Below performance data shows best possible cluster throughput and lower latency when Ceph OSDs are partitioned by CPU socket to manage media connected to local CPU socket and network IO not going through QPI link. 
86
87
!3-numa-node-perf-vs-default-sys-config.png!
88 8 Patrick McGarry
!3.1-table.png!
89 1 Patrick McGarry
Figure 3: NUMA node performance compared to default system configuration
90 8 Patrick McGarry
91 1 Patrick McGarry
h3. Memory Tuning
92
93
Ceph default packages use tcmalloc. For flash optimized configurations, we found jemalloc providing best possible performance without performance degradation over time. Ceph supports jemalloc for the hammer release and later releases but you need to build with jemalloc option enabled.
94
95
Below graph in figure 4 shows how thread cache size impacts throughput. By tuning thread cache size, performance is comparable between TCMalloc and JEMalloc. However as shown in Figure 5 and Figure 6, TCMalloc performance degrades over time unlike JEMalloc. 
96
97
!4-thread-cache-size-impact-over-perf.png!
98
Figure 4: Thread cache size impact over performance
99
100
!5.0-tcmalloc-over-time.png!
101
!5.1-tcmalloc-over-time.png!
102
Figure 5: TCMalloc performance in a running cluster over time
103 8 Patrick McGarry
104 1 Patrick McGarry
105
!6.0-jemalloc-over-time.png!
106 8 Patrick McGarry
!6.1-jemalloc-over-time.png!
107 1 Patrick McGarry
Figure 6: JEMalloc performance in a running cluster over time
108
109 8 Patrick McGarry
h3. NVMe SSD partitioning
110 14 Patrick McGarry
111 1 Patrick McGarry
It is not possible to take advantage of NVMe SSD bandwidth with single OSD.  4 is the optimum number of partitions per SSD drive that gives best possible performance. 
112
113 14 Patrick McGarry
!7.0-ceph-osd-latency-with-different-ssd-partitions.png!
114 8 Patrick McGarry
Figure 7: Ceph OSD latency with different SSD partitions
115
116 14 Patrick McGarry
!8-cpu-utilization-with-different-num-of-ssd-partitions.png!
117 1 Patrick McGarry
Figure 8: CPU Utilization with different #of SSD partitions 
118 8 Patrick McGarry
119 14 Patrick McGarry
120
h3. OS Tuning
121
122 1 Patrick McGarry
*(must be done on all Ceph nodes)*
123 14 Patrick McGarry
124
h3. Kernel Tuning
125
126 15 Patrick McGarry
1. Modify system control in /etc/sysctl.conf
127 14 Patrick McGarry
<pre>
128 1 Patrick McGarry
# Kernel sysctl configuration file for Red Hat Linux
129
#
130
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
131
# sysctl.conf(5) for more details.
132
133
# Controls IP packet forwarding
134
net.ipv4.ip_forward = 0
135
136
# Controls source route verification
137
net.ipv4.conf.default.rp_filter = 1
138
139
# Do not accept source routing
140
net.ipv4.conf.default.accept_source_route = 0
141
142
# Controls the System Request debugging functionality of the kernel
143
kernel.sysrq = 0
144
145
# Controls whether core dumps will append the PID to the core filename.
146
# Useful for debugging multi-threaded applications.
147
kernel.core_uses_pid = 1
148
149
# disable TIME_WAIT.. wait ..
150
net.ipv4.tcp_tw_recycle = 1
151
net.ipv4.tcp_tw_reuse = 1
152
153
# Controls the use of TCP syncookies
154
net.ipv4.tcp_syncookies = 0
155
156
# double amount of allowed conntrack
157
net.netfilter.nf_conntrack_max = 2621440
158
net.netfilter.nf_conntrack_tcp_timeout_established = 1800
159
160
# Disable netfilter on bridges.
161
net.bridge.bridge-nf-call-ip6tables = 0
162
net.bridge.bridge-nf-call-iptables = 0
163
net.bridge.bridge-nf-call-arptables = 0
164
165
# Controls the maximum size of a message, in bytes
166
kernel.msgmnb = 65536
167
168
# Controls the default maxmimum size of a mesage queue
169
kernel.msgmax = 65536
170
171
# Controls the maximum shared segment size, in bytes
172
kernel.shmmax = 68719476736
173
174
# Controls the maximum number of shared memory segments, in pages
175
kernel.shmall = 4294967296
176
</pre>
177 14 Patrick McGarry
178 15 Patrick McGarry
2. IP jumbo frames
179 1 Patrick McGarry
180 15 Patrick McGarry
    If your switch supports jumbo frames, then the larger MTU size is helpful. Our tests showed 9000 MTU improves Sequential Read/Write performance.
181 14 Patrick McGarry
182 15 Patrick McGarry
3. Set the Linux disk scheduler to cfq
183 1 Patrick McGarry
184
h3. Filesystem considerations
185
186
Ceph is designed to be mostly filesystem agnostic–the only requirement being that the filesystem supports extended attributes (xattrs). Ceph OSDs depend on the Extended Attributes (XATTRs) of the underlying file system for: a) Internal object state b) Snapshot metadata c) RGW Access control Lists etc. Currently XFS is the recommended file system. We recommend using big inode size (default inode size is 256 bytes) when creating the file system:
187
188 15 Patrick McGarry
<pre>mkfs.xfs –i size=2048 /dev/sda1</pre>
189
190 14 Patrick McGarry
Setting the inode size is important, as XFS stores xattr data in the inode. If the metadata is too large to fit in the inode, a new extent is created, which can cause quite a performance problem. Upping the inode size to 2048 bytes provides enough room to write the default metadata, plus a little headroom.
191 1 Patrick McGarry
The following example mount options are recommended when using XFS:
192 15 Patrick McGarry
193
<pre>mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8 /dev/sda1 /var/lib/Ceph/osd/Ceph-0</pre>
194
195 1 Patrick McGarry
The following are specific recommendations for Intel SSD and Ceph. 
196 15 Patrick McGarry
197
<pre>mkfs.xfs -f -K -i size=2048 -s size=4096 /dev/md0
198
/bin/mount -o noatime,nodiratime,nobarrier /dev/md0 /data/mysql</pre>
199
200
h3. Disk read ahead
201
202 1 Patrick McGarry
Read_ahead is the file prefetching technology used in the Linux operating system. It is a system call that loads a file's contents into the page cache. When a file is subsequently accessed, its contents are read from physical memory rather than from disk, which is much faster.
203
204 15 Patrick McGarry
<pre>echo 2048 > /sys/block/${disk}/queue/read_ahead_kb  (default 128)</pre>
205
206
|. Per disk performance |. 128 |. 512 |. % |
207
| Sequential Read(MB/s) | 1232 MB/s | 3251 MB/s | +163% |
208
209 1 Patrick McGarry
* 6 nodes Ceph cluster, each have 20 OSD (750 GB * 7200 RPM. 2.5’’ HDD)
210
211 15 Patrick McGarry
h3. OSD: RADOS
212
213 1 Patrick McGarry
Tuning have significant performance impact of Ceph storage system, there are hundreds of tuning knobs for swift. We will introduce some of the most important tuning settings.
214
215 15 Patrick McGarry
# Large PG/PGP number (since Cuttlefish)
216
We find using large PG number per OSD (>200) will improve the performance. Also this will ease the data distribution unbalance issue
217
<pre>(default to 8)
218
ceph osd pool create testpool 8192 8192</pre>
219
# omap data on separate disks (since Giant)
220 1 Patrick McGarry
Mounting omap directory to some separate SSD will improve the random write performance. In our testing we saw a ~20% performance improvement.
221 15 Patrick McGarry
# objecter_inflight_ops/objecter_inflight_op_bytes (since Cuttlefish)
222 1 Patrick McGarry
objecter_inflight_ops/objecter_inflight_op_bytes throttles tell objecter to throttle outgoing ops according its budget, objecter is responsible for send requests to OSD. By default tweak this parameter to 10x 
223 15 Patrick McGarry
<pre>(default to 1024/1024*1024*100)
224 1 Patrick McGarry
objecter_inflight_ops = 10240
225 15 Patrick McGarry
objecter_inflight_op_bytes = 1048576000</pre>
226
# ms_dispatch_throttle_bytes (since Cuttlefish)
227 1 Patrick McGarry
ms_dispatch_throttle_bytes throttle is to throttle dispatch message size for simple messenger, by default tweak this parameter to 10x. 
228 15 Patrick McGarry
<pre>ms_dispatch_throttle_bytes = 1048576000</pre>
229
# journal_queue_max_bytes/journal_queue_max_ops (since Cuttlefish)
230 1 Patrick McGarry
journal_queue_max_bytes/journal_queue_max_op throttles are to throttle inflight ops for journal, 
231
If journal does not get enough budget for current op, it will block osd op thread, by default tweak this parameter to 10x.
232 15 Patrick McGarry
<pre>journal_queueu_max_ops = 3000
233
journal_queue_max_bytes = 1048576000</pre>
234
# filestore_queue_max_ops/filestore_queue_max_bytes (since Cuttlefish)
235 1 Patrick McGarry
filestore_queue_max_ops/filestore_queue_max_bytes throttle are used to throttle inflight ops for filestore, these throttles are checked before sending ops to journal, so if filestore does not get enough budget for current op, osd op thread will be blocked, by default tweak this parameter to 10x.
236 15 Patrick McGarry
<pre>filestore_queue_max_ops=5000
237
filestore_queue_max_bytes = 1048576000</pre>
238
# filestore_op_threads controls the number of filesystem operation threads that execute in parallel
239 1 Patrick McGarry
If the storage backend is fast enough and has enough queues to support parallel operations, it’s recommended to increase this parameter, given there is enough CPU head room.
240 15 Patrick McGarry
<pre>filestore_op_threads=6</pre>
241
# journal_max_write_entries/journal_max_write_bytes (since Cuttlefish)
242 1 Patrick McGarry
journal_max_write_entries/journal_max_write_bytes throttle are used to throttle ops or bytes for every journal write, tweaking these two parameters maybe helpful for small write, by default tweak these two parameters to 10x
243 15 Patrick McGarry
<pre>journal_max_write_entries = 5000
244
journal_max_write_bytes = 1048576000</pre>
245
# osd_op_num_threads_per_shard/osd_op_num_shards (since Firefly)
246 1 Patrick McGarry
osd_op_num_shards set number of queues to cache requests , osd_op_num_threads_per_shard is    threads number for each queue,  adjusting these two parameters depends on cluster.
247
After several performance tests with different settings, we concluded that default parameters provide best performance.
248 15 Patrick McGarry
# filestore_max_sync_interval (since Cuttlefish)
249 1 Patrick McGarry
filestore_max_sync_interval control the interval that sync thread flush data from memory to disk, by default filestore write data to memory and sync thread is responsible for flushing data to disk, then journal entries can be trimmed. Note that large filestore_max_sync_interval can cause performance spike. By default tweak this parameter to 10 seconds
250 15 Patrick McGarry
<pre>filestore_max_sync_interval = 10</pre>
251
# ms_crc_data/ms_crc_header (since Cuttlefish)
252 1 Patrick McGarry
Disable crc computation for simple messenger, this can reduce CPU utilization
253 15 Patrick McGarry
# filestore_fd_cache_shards/filestore_fd_cache_size (since Firefly)
254 1 Patrick McGarry
filestore cache is map from objectname to fd, filestore_fd_cache_shards set number of LRU Cache,  filestore_fd_cache_size is cache size, tweak these two parameter maybe reduce lookup time of fd
255 15 Patrick McGarry
# Set debug level to 0 (since Cuttlefish)
256 1 Patrick McGarry
For an all-SSD Ceph cluster, set debug level for sub system to 0 will improve the performance.  
257 15 Patrick McGarry
<pre>debug_lockdep = 0/0
258 1 Patrick McGarry
debug_context = 0/0
259
debug_crush = 0/0
260
debug_buffer = 0/0
261
debug_timer = 0/0
262
debug_filer = 0/0
263
debug_objecter = 0/0
264
debug_rados = 0/0
265
debug_rbd = 0/0
266
debug_journaler = 0/0
267
debug_objectcatcher = 0/0
268
debug_client = 0/0
269
debug_osd = 0/0
270
debug_optracker = 0/0
271
debug_objclass = 0/0
272
debug_filestore = 0/0
273
debug_journal = 0/0
274
debug_ms = 0/0
275
debug_monc = 0/0
276
debug_tp = 0/0
277
debug_auth = 0/0
278
debug_finisher = 0/0
279
debug_heartbeatmap = 0/0
280
debug_perfcounter = 0/0
281
debug_asok = 0/0
282
debug_throttle = 0/0
283
debug_mon = 0/0
284
debug_paxos = 0/0
285 15 Patrick McGarry
debug_rgw = 0/0</pre>
286 1 Patrick McGarry
287
288 15 Patrick McGarry
h3. RBD Tuning
289
290 1 Patrick McGarry
To help achieve low latency on their RBD layer, we suggest the following, in addition to the CERN tuning referenced in ceph.com. 
291 15 Patrick McGarry
# echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null 
292
# start each ceph-osd in dedicated cgroup with dedicated cpu cores (which should be free from any other load, even the kernel one like network interrupts)
293
# increase “filestore_omap_header_cache_size” • “filestore_fd_cache_size” , for better caching (16MB for each 500GB of storage)
294 1 Patrick McGarry
For disk entry in libvirt  put address to all three ceph monitors.
295
296 15 Patrick McGarry
h3. RGW: Rados Gateway Tuning
297
298
# Disable usage/access log (since Cuttlefish)
299
<pre>rgw enable ops log = false
300 1 Patrick McGarry
rgw enable usage log = false
301 15 Patrick McGarry
log file = /dev/null</pre>
302 1 Patrick McGarry
We find disabling usage/access log improves the performance.
303 15 Patrick McGarry
# Using large cache size (since Cuttlefish)
304
<pre>rgw cache enabled = true
305
rgw cache lru size = 100000</pre>
306 1 Patrick McGarry
Caching the hot objects improves the GET performance.
307 15 Patrick McGarry
# Using larger PG split/merge value.  (since Firefly)
308
<pre>filestore_merge_threshold = 500
309
filestore_split_multiple = 100</pre>
310 1 Patrick McGarry
We find PG split/merge will introduce a big overhead. Using a large value would postpone the split/merge behavior. This will help the case where lots of small files are stored in the cluster.
311 15 Patrick McGarry
# Using load balancer with multiple RGW instances (since Cuttlefish)
312
!9-load-balancer-with-multi-rgw-instances.png!
313 1 Patrick McGarry
We’ve found that the RGW has some scalability issues at present. With a single RGW instance the performance is poor. Running multiple RGW instances with a load balancer (e.g., Haproxy) will greatly improve the throughput.
314 15 Patrick McGarry
# Increase the number of Rados handlers (since Hammer)
315 1 Patrick McGarry
Since Hammer it’s able to using multiple number of Rados handlers per RGW instances. Increasing this value should improve the performance.
316 15 Patrick McGarry
# Using Civetweb frontend (since Giant)
317 1 Patrick McGarry
Before Giant, Apache + Libfastcgi were the recommended settings. However libfastcgi still use the very old ‘select’ mode, which is not able to handle large amount of concurrent IO in our testing. Using Civetweb frontend would help to improve the stability.
318 15 Patrick McGarry
<pre>rgw frontends =civetweb port=80</pre>
319
# Moving bucket index to SSD (since Giant)
320 1 Patrick McGarry
Bucket index updating maybe some bottleneck if there’s millions of objects in one single bucket. We’ve find moving the bucket index to SSD storage will improve the performance.
321 15 Patrick McGarry
# Bucket Index Sharding (since Hammer)
322 1 Patrick McGarry
We’ve find the bucket index sharding is a problem if there’s large amount of objects inside one bucket. However the index listing speed may be impacted.
323
324 15 Patrick McGarry
h3. Erasure Coding Tuning
325 1 Patrick McGarry
326 15 Patrick McGarry
# Use larger stripe width 
327
The default erasure code stripe size (4K) is not optimal, We find using a bigger value (64K) will reduce the CPU% a lot (10%+)
328
<pre>osd_pool_erasure_code_stripe_width = 65536</pre>
329
# Use mid-sized K
330 1 Patrick McGarry
For the Erasure Code algorithms, we find using some mid-sized K value would bring balanced results between throughput and CPU%. We recommend to use 10+4 or 8+2 mode
331
332 15 Patrick McGarry
h3. Appendix
333
334
h3. Sample Ceph.conf 
335
336
<pre>[global]
337 1 Patrick McGarry
fsid = 35b08d01-b688-4b9a-947b-bc2e25719370
338
mon_initial_members = gw2
339
mon_host = 10.10.10.105
340
filestore_xattr_use_omap = true
341
auth_cluster_required = none
342
auth_service_required = none
343
auth_client_required = none
344
debug_lockdep = 0/0
345
debug_context = 0/0
346
debug_crush = 0/0
347
debug_buffer = 0/0
348
debug_timer = 0/0
349
debug_filer = 0/0
350
debug_objecter = 0/0
351
debug_rados = 0/0
352
debug_rbd = 0/0
353
debug_journaler = 0/0
354
debug_objectcatcher = 0/0
355
debug_client = 0/0
356
debug_osd = 0/0
357
debug_optracker = 0/0
358
debug_objclass = 0/0
359
debug_filestore = 0/0
360
debug_journal = 0/0
361
debug_ms = 0/0
362
debug_monc = 0/0
363
debug_tp = 0/0
364
debug_auth = 0/0
365
debug_finisher = 0/0
366
debug_heartbeatmap = 0/0
367
debug_perfcounter = 0/0
368
debug_asok = 0/0
369
debug_throttle = 0/0
370
debug_mon = 0/0
371
debug_paxos = 0/0
372
debug_rgw = 0/0
373
[mon]
374
mon_pg_warn_max_per_osd=5000
375
mon_max_pool_pg_num=106496
376
[client]
377
rbd cache = false
378
[osd]
379
osd mkfs type = xfs
380
osd mount options xfs = rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
381
osd mkfs options xfs = -f -i size=2048
382
filestore_queue_max_ops=5000
383
filestore_queue_max_bytes = 1048576000
384
filestore_max_sync_interval = 10
385
filestore_merge_threshold = 500
386
filestore_split_multiple = 100
387
osd_op_shard_threads = 8
388
journal_max_write_entries = 5000
389
journal_max_write_bytes = 1048576000
390
journal_queueu_max_ops = 3000
391
journal_queue_max_bytes = 1048576000
392
ms_dispatch_throttle_bytes = 1048576000
393
objecter_inflight_op_bytes = 1048576000
394
public_network = 10.10.10.100/24
395
cluster_network = 10.10.10.100/24
396
397
[client.radosgw.gw2-1]
398
host = gw2
399
keyring = /etc/ceph/ceph.client.radosgw.keyring
400
rgw cache enabled = true
401
rgw cache lru size = 100000
402
rgw socket path = /var/run/ceph/ceph.client.radosgw.gw2-1.fastcgi.sock
403
rgw thread pool size = 256
404
rgw enable ops log = false
405
rgw enable usage log = false
406
log file = /dev/null
407
rgw frontends =civetweb port=80
408 15 Patrick McGarry
rgw override bucket index max shards = 8</pre>
409 1 Patrick McGarry
410 15 Patrick McGarry
h3. Sample sysctl.conf 
411
412
<pre>fs.file-max = 6553600
413 1 Patrick McGarry
net.ipv4.ip_local_port_range = 1024 65000
414
net.ipv4.tcp_fin_timeout = 20
415
net.ipv4.tcp_max_syn_backlog = 819200
416
net.ipv4.tcp_keepalive_time = 20
417
kernel.msgmni = 2878
418
kernel.sem = 256 32000 100 142
419
kernel.shmmni = 4096
420
net.core.rmem_default = 1048576
421
net.core.rmem_max = 1048576
422
net.core.wmem_default = 1048576
423
net.core.wmem_max = 1048576
424
net.core.somaxconn = 40000
425
net.core.netdev_max_backlog = 300000
426 15 Patrick McGarry
net.ipv4.tcp_max_tw_buckets = 10000</pre>
427 1 Patrick McGarry
428 15 Patrick McGarry
h3. All-NVMe Ceph Cluster Tuning for MySQL workload
429
430
h3. Ceph.conf 
431
432
<pre>[global]
433 1 Patrick McGarry
        enable experimental unrecoverable data corrupting features = bluestore rocksdb
434
        osd objectstore = bluestore
435
        ms_type = async
436
        rbd readahead disable after bytes = 0
437
        rbd readahead max bytes = 4194304
438
        bluestore default buffered read = true
439
        auth client required = none
440
        auth cluster required = none
441
        auth service required = none
442
        filestore xattr use omap = true
443
        cluster network = 192.168.142.0/24, 192.168.143.0/24
444
        private network = 192.168.144.0/24, 192.168.145.0/24
445
        log file = /var/log/ceph/$name.log
446
        log to syslog = false
447
        mon compact on trim = false
448
        osd pg bits = 8
449
        osd pgp bits = 8
450
        mon pg warn max object skew = 100000
451
        mon pg warn min per osd = 0
452
        mon pg warn max per osd = 32768
453
        debug_lockdep = 0/0
454
        debug_context = 0/0
455
        debug_crush = 0/0
456
        debug_buffer = 0/0
457
        debug_timer = 0/0
458
        debug_filer = 0/0
459
        debug_objecter = 0/0
460
        debug_rados = 0/0
461
        debug_rbd = 0/0
462
        debug_ms = 0/0
463
        debug_monc = 0/0
464
        debug_tp = 0/0
465
        debug_auth = 0/0
466
        debug_finisher = 0/0
467
        debug_heartbeatmap = 0/0
468
        debug_perfcounter = 0/0
469
        debug_asok = 0/0
470
        debug_throttle = 0/0
471
        debug_mon = 0/0
472
        debug_paxos = 0/0
473
        debug_rgw = 0/0
474
        perf = true
475
        mutex_perf_counter = true
476
        throttler_perf_counter = false
477
        rbd cache = false
478
[mon]
479
        mon data =/home/bmpa/tmp_cbt/ceph/mon.$id
480
        mon_max_pool_pg_num=166496
481
        mon_osd_max_split_count = 10000
482
        mon_pg_warn_max_per_osd = 10000
483
[mon.a]
484
        host = ft02
485
        mon addr = 192.168.142.202:6789
486
[osd]
487
        osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
488
        osd_mkfs_options_xfs = -f -i size=2048
489
        osd_op_threads = 32
490
        filestore_queue_max_ops=5000
491
        filestore_queue_committing_max_ops=5000
492
        journal_max_write_entries=1000
493
        journal_queue_max_ops=3000
494
        objecter_inflight_ops=102400
495
        filestore_wbthrottle_enable=false
496
        filestore_queue_max_bytes=1048576000
497
        filestore_queue_committing_max_bytes=1048576000
498
        journal_max_write_bytes=1048576000
499
        journal_queue_max_bytes=1048576000
500
        ms_dispatch_throttle_bytes=1048576000
501
        objecter_infilght_op_bytes=1048576000
502
        osd_mkfs_type = xfs
503
        filestore_max_sync_interval=10
504
        osd_client_message_size_cap = 0
505
        osd_client_message_cap = 0
506
        osd_enable_op_tracker = false
507
        filestore_fd_cache_size = 64
508
        filestore_fd_cache_shards = 32
509 15 Patrick McGarry
        filestore_op_threads = 6</pre>
510 1 Patrick McGarry
511 15 Patrick McGarry
h3. CBT YAML
512
513
<pre>cluster:
514 1 Patrick McGarry
  user: "bmpa"
515
  head: "ft01"
516
  clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"]
517
  osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"]
518
  mons:
519
   ft02:
520
     a: "192.168.142.202:6789"
521
osds_per_node: 16
522
  fs: xfs
523
  mkfs_opts: '-f -i size=2048 -n size=64k'
524
  mount_opts: '-o inode64,noatime,logbsize=256k'
525
  conf_file: '/home/bmpa/cbt/ceph.conf'
526
  use_existing: False
527
  newstore_block: True
528
  rebuild_every_test: False
529
  clusterid: "ceph"
530
iterations: 1
531
  tmp_dir: "/home/bmpa/tmp_cbt"
532
pool_profiles:
533
    2rep:
534
      pg_size: 8192
535
      pgp_size: 8192
536
      replication: 2
537
benchmarks:
538
  librbdfio:
539
    time: 300
540
    ramp: 300
541
    vol_size: 10
542
    mode: ['randrw']
543
    rwmixread: [0,70,100]
544
    op_size: [4096]
545
    procs_per_volume: [1]
546
    volumes_per_client: [10]
547
    use_existing_volumes: False
548
    iodepth: [4,8,16,32,64,128]
549
    osd_ra: [4096]
550
    norandommap: True
551
    cmd_path: '/usr/local/bin/fio'
552
    pool_profile: '2rep'
553 15 Patrick McGarry
    log_avg_msec: 250</pre>
554 1 Patrick McGarry
555 15 Patrick McGarry
h3. MySQL configuration file (my.cnf)
556
557
<pre>[client]
558 1 Patrick McGarry
port            = 3306
559
socket          = /var/run/mysqld/mysqld.sock
560
[mysqld_safe]
561
socket          = /var/run/mysqld/mysqld.sock
562
nice            = 0
563
[mysqld]
564
user            = mysql
565
pid-file        = /var/run/mysqld/mysqld.pid
566
socket          = /var/run/mysqld/mysqld.sock
567
port            = 3306
568
datadir         = /data
569
basedir         = /usr
570
tmpdir          = /tmp
571
lc-messages-dir = /usr/share/mysql
572
skip-external-locking
573
bind-address            = 0.0.0.0
574
max_allowed_packet      = 16M
575
thread_stack            = 192K
576
thread_cache_size       = 8
577
query_cache_limit       = 1M
578
query_cache_size        = 16M
579
log_error = /var/log/mysql/error.log
580
expire_logs_days        = 10
581
max_binlog_size         = 100M
582
performance_schema=off
583
innodb_buffer_pool_size = 25G
584
innodb_flush_method = O_DIRECT
585
innodb_log_file_size=4G
586
thread_cache_size=16
587
innodb_file_per_table
588
innodb_checksums = 0
589
innodb_flush_log_at_trx_commit = 0
590
innodb_write_io_threads = 8
591
innodb_page_cleaners= 16
592
innodb_read_io_threads = 8
593
max_connections = 50000
594
[mysqldump]
595
quick
596
quote-names
597
max_allowed_packet      = 16M
598
[mysql]
599 15 Patrick McGarry
!includedir /etc/mysql/conf.d/</pre>
600 1 Patrick McGarry
601
602 15 Patrick McGarry
h3. Sample Ceph Vendor Solutions
603
604 1 Patrick McGarry
The following are pointers to Ceph solutions, but this list is not comprehensive:
605 15 Patrick McGarry
* https://www.dell.com/learn/us/en/04/shared-content~data-sheets~en/documents~dell-red-hat-cloud-solutions.pdf 
606
* http://www.fujitsu.com/global/products/computing/storage/eternus-cd/
607
* http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA5-2799ENW.pdf http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA5-8638ENW.pdf 
608
* http://www.supermicro.com/solutions/storage_ceph.cfm 
609
* https://www.thomas-krenn.com/en/products/storage-systems/suse-enterprise-storage.html
610
* http://www.qct.io/Solution/Software-Defined-Infrastructure/Storage-Virtualization/QCT-and-Red-Hat-Ceph-Storage-p365c225c226c230 
611 1 Patrick McGarry
612 15 Patrick McGarry
613
*Notices:*
614 1 Patrick McGarry
Copyright © 2016 Intel Corporation. All rights reserved
615
Intel, the Intel logo, Intel Atom, Intel Core, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
616
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
617
Intel® Hyper-Threading Technology available on select Intel® Core™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.
618
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
619
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. 
620
621
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. 
622
623
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. 
624
625
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. 
626
627
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.