Project

General

Profile

Tuning for All Flash Deployments » History » Version 11

Patrick McGarry, 01/14/2017 08:38 PM

1 1 Patrick McGarry
h1. Tuning for All Flash Deployments
2
3
Ceph Tuning and Best Practices for All Flash Intel® Xeon® Servers
4
Last updated:  January 2017
5
6
7
8
9
h2. Table of Contents
10 9 Patrick McGarry
11 11 Patrick McGarry
+*[[Tuning_for_All_Flash_Deployments#Introduction|Introduction]]*+
12
+*[[Tuning_for_All_Flash_Deployments#Ceph-Storage-Hardware-Guidelines|Ceph Storage Hardware Guidelines]]*+
13
+*[[Tuning_for_All_Flash_Deployments#Intel-Tuning-and-Optimization-Recommendations-for-Ceph|Intel Tuning and Optimization Recommendations for Ceph]]*+
14
> +[[Tuning_for_All_Flash_Deployments#Server-Tuning|Server Tuning]]+
15
> > +[[Tuning_for_All_Flash_Deployments#Ceph-Client-Configuration|Ceph Client Configuration]]+
16
> > +[[Tuning_for_All_Flash_Deployments#Ceph-Storage-Node-NUMA-Tuning|Ceph Storage Node NUMA Tuning]]+
17
> +[[Tuning_for_All_Flash_Deployments#Memory-Tuning|Memory Tuning]]+
18
> +[[Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning|NVMe SSD partitioning]]+
19
> +[[Tuning_for_All_Flash_Deployments#OS-Tuning|OS Tuning]] (must be done on all Ceph nodes)+
20
> > +[[Tuning_for_All_Flash_Deployments#Kernel-Tuning|Kernel Tuning]]+
21
> > +[[Tuning_for_All_Flash_Deployments#Filesystem-considerations|Filesystem considerations]]+
22
> > +[[Tuning_for_All_Flash_Deployments#Disk-read-ahead|Disk read ahead]]+
23
> > +[[Tuning_for_All_Flash_Deployments#OSD-RADOS|OSD: RADOS]]+
24
> +[[Tuning_for_All_Flash_Deployments#RBD-Tuning|RBD Tuning]]+
25
> +[[Tuning_for_All_Flash_Deployments#RGW-Rados-Gateway-Tuning|RGW: Rados Gateway Tuning]]+
26
> +[[Tuning_for_All_Flash_Deployments#Erasure-Coding-Tuning|Erasure Coding Tuning]]+
27
+*[[Tuning_for_All_Flash_Deployments#Appendix|Appendix]]*+
28
+*[[Tuning_for_All_Flash_Deployments#Sample-Ceph-conf|Sample Ceph.conf]]*+
29
+*[[Tuning_for_All_Flash_Deployments#Sample-sysctl-conf|Sample sysctl.conf]]*+
30
+*[[Tuning_for_All_Flash_Deployments#All-NVMe-Ceph-Cluster-Tuning-for-MySQL-workload|All-NVMe Ceph Cluster Tuning for MySQL workload]]*+
31
> +[[Tuning_for_All_Flash_Deployments#Ceph-conf|Ceph.conf]]+
32
> +[[Tuning_for_All_Flash_Deployments#CBT-YAML|CBT YAML]]+
33
> +[[Tuning_for_All_Flash_Deployments#MySQL-configuration-file|MySQL configuration file]] (my.cnf)+
34
+*[[Tuning_for_All_Flash_Deployments#Sample-Ceph-Vendor-Solutions|Sample Ceph Vendor Solutions]]*+
35 1 Patrick McGarry
36
37
38
39
h3. Introduction 
40
41 8 Patrick McGarry
Ceph is a scalable, open source, software-defined storage offering that runs on commodity hardware. Ceph has been developed from the ground up to deliver object, block, and file system storage in a single software platform that is self-managing, self-healing and has no single point of failure. Because of its highly scalable, software defined storage architecture, can be a powerful storage solution to consider. 
42
This document covers Ceph tuning guidelines specifically for all flash deployments based on extensive testing by Intel with a variety of system, operating system and Ceph optimizations to achieve highest possible performance for servers with Intel® Xeon® processors and Intel® Solid State Drive Data Center (Intel® SSD DC) Series. Details of OEM system SKUs and Ceph reference architectures for targeted use-cases can be found on ceph.com web-site. 
43
 
44 1 Patrick McGarry
h3. Ceph Storage Hardware Guidelines  
45 8 Patrick McGarry
46 1 Patrick McGarry
* *Standard* configuration is ideally suited for throughput oriented workloads (e.g., analytics,  DVR). Intel® SSD Data Center P3700 series is recommended to achieve best possible performance while balancing the cost.
47 8 Patrick McGarry
48
| CPU | Intel® Xeon® CPU E5-2650v4 or higher |
49
| Memory | Minimum of 64 GB|
50 1 Patrick McGarry
| NIC | 10GbE |
51 10 Patrick McGarry
| Disks | 1x 1.6TB P3700 + 12 x 4TB HDDs (1:12 ratio) / P3700 as Journal and caching |
52 8 Patrick McGarry
| Caching software | Intel Cache Acceleration Software for read caching, option: Intel® Rapid Storage Technology enterprise/MD4.3 |
53
54 1 Patrick McGarry
* *TCO optimized* configuration provides best possible performance for performance centric workloads (e.g., database) while achieving the TCO with a mix of SATA SSDs and NVMe SSDs. 
55 8 Patrick McGarry
56
| CPU | Intel® Xeon® CPU E5-2690v4 or higher |
57
| Memory | 128 GB or higher |
58
| NIC | Dual 10GbE 
59
| Disks | 1x 800GB P3700 + 4x S3510 1.6TB |
60
61
* *IOPS optimized* configuration provides best performance for workloads that demand low latency using all NVMe SSD configuration. 
62
63
| CPU | Intel® Xeon® CPU E5-2699v4 |
64
| Memory | 128 GB or higher |
65
| NIC | 1x 40GbE, 4x 10GbE |
66
| Disks | 4 x P3700 2TB |
67
68 11 Patrick McGarry
h1. Intel Tuning and Optimization Recommendations for Ceph
69 8 Patrick McGarry
70 11 Patrick McGarry
h2. Server Tuning
71 8 Patrick McGarry
72 11 Patrick McGarry
h3. Ceph Client Configuration
73 10 Patrick McGarry
74 1 Patrick McGarry
In a balanced system configuration both client and storage node configuration need to be optimized to get the best possible cluster performance. Care needs to be taken to ensure Ceph client node server has enough CPU bandwidth to achieve optimum performance. Below graph shows the end to end performance for different client CPU configurations for block workload.
75
76 8 Patrick McGarry
!1-cpu-cores-client.png!
77 1 Patrick McGarry
Figure 1: Client CPU cores and Ceph cluster impact
78 8 Patrick McGarry
79
h3. Ceph Storage Node NUMA Tuning
80
81 1 Patrick McGarry
In order to avoid latency, it is important to minimize inter-socket communication between NUMA nodes to service client IO as fast as possible and avoid latency penalty.  Based on extensive set of experiments conducted in Intel, it is recommended to pin Ceph OSD processes on the same CPU socket that has NVMe SSDs, HBAs and NIC devices attached. 
82 8 Patrick McGarry
83
!2-numa-mode-config.png! 
84 1 Patrick McGarry
Figure 2: NUMA node configuration and OSD assignment
85
86 8 Patrick McGarry
> *_Ceph startup scripts need change with setaffinity=" numactl --membind=0 --cpunodebind=0 "_
87
88 1 Patrick McGarry
Below performance data shows best possible cluster throughput and lower latency when Ceph OSDs are partitioned by CPU socket to manage media connected to local CPU socket and network IO not going through QPI link. 
89
90 8 Patrick McGarry
!3-numa-node-perf-vs-default-sys-config.png!
91
!3.1-table.png!
92 1 Patrick McGarry
Figure 3: NUMA node performance compared to default system configuration
93 8 Patrick McGarry
94
h3. Memory Tuning
95
96 1 Patrick McGarry
Ceph default packages use tcmalloc. For flash optimized configurations, we found jemalloc providing best possible performance without performance degradation over time. Ceph supports jemalloc for the hammer release and later releases but you need to build with jemalloc option enabled.
97 8 Patrick McGarry
98 1 Patrick McGarry
Below graph in figure 4 shows how thread cache size impacts throughput. By tuning thread cache size, performance is comparable between TCMalloc and JEMalloc. However as shown in Figure 5 and Figure 6, TCMalloc performance degrades over time unlike JEMalloc. 
99
100 8 Patrick McGarry
!4-thread-cache-size-impact-over-perf.png!
101 1 Patrick McGarry
Figure 4: Thread cache size impact over performance
102
103 8 Patrick McGarry
!5.0-tcmalloc-over-time.png!
104
!5.1-tcmalloc-over-time.png!
105 1 Patrick McGarry
Figure 5: TCMalloc performance in a running cluster over time
106
107
108 8 Patrick McGarry
!6.0-jemalloc-over-time.png!
109
!6.1-jemalloc-over-time.png!
110 1 Patrick McGarry
Figure 6: JEMalloc performance in a running cluster over time
111
112 8 Patrick McGarry
h3. NVMe SSD partitioning
113 1 Patrick McGarry
It is not possible to take advantage of NVMe SSD bandwidth with single OSD.  4 is the optimum number of partitions per SSD drive that gives best possible performance. 
114
115
Figure 7: Ceph OSD latency with different SSD partitions
116
117
Figure 8: CPU Utilization with different #of SSD partitions 
118
119
OS Tuning (must be done on all Ceph nodes)
120
Kernel Tuning
121
1. Modify system control in /etc/sysctl.conf
122
# Kernel sysctl configuration file for Red Hat Linux
123
#
124
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
125
# sysctl.conf(5) for more details.
126
127
# Controls IP packet forwarding
128
net.ipv4.ip_forward = 0
129
130
# Controls source route verification
131
net.ipv4.conf.default.rp_filter = 1
132
133
# Do not accept source routing
134
net.ipv4.conf.default.accept_source_route = 0
135
136
# Controls the System Request debugging functionality of the kernel
137
kernel.sysrq = 0
138
139
# Controls whether core dumps will append the PID to the core filename.
140
# Useful for debugging multi-threaded applications.
141
kernel.core_uses_pid = 1
142
143
# disable TIME_WAIT.. wait ..
144
net.ipv4.tcp_tw_recycle = 1
145
net.ipv4.tcp_tw_reuse = 1
146
147
# Controls the use of TCP syncookies
148
net.ipv4.tcp_syncookies = 0
149
150
# double amount of allowed conntrack
151
net.netfilter.nf_conntrack_max = 2621440
152
net.netfilter.nf_conntrack_tcp_timeout_established = 1800
153
154
# Disable netfilter on bridges.
155
net.bridge.bridge-nf-call-ip6tables = 0
156
net.bridge.bridge-nf-call-iptables = 0
157
net.bridge.bridge-nf-call-arptables = 0
158
159
# Controls the maximum size of a message, in bytes
160
kernel.msgmnb = 65536
161
162
# Controls the default maxmimum size of a mesage queue
163
kernel.msgmax = 65536
164
165
# Controls the maximum shared segment size, in bytes
166
kernel.shmmax = 68719476736
167
168
# Controls the maximum number of shared memory segments, in pages
169
kernel.shmall = 4294967296
170
171
2. IP jumbo frames
172
If your switch supports jumbo frames, then the larger MTU size is helpful. Our tests showed 9000 MTU improves Sequential Read/Write performance.
173
174
3. Set the Linux disk scheduler to cfq
175
Filesystem considerations
176
Ceph is designed to be mostly filesystem agnostic–the only requirement being that the filesystem supports extended attributes (xattrs). Ceph OSDs depend on the Extended Attributes (XATTRs) of the underlying file system for: a) Internal object state b) Snapshot metadata c) RGW Access control Lists etc. Currently XFS is the recommended file system. We recommend using big inode size (default inode size is 256 bytes) when creating the file system:
177
mkfs.xfs –i size=2048 /dev/sda1
178
Setting the inode size is important, as XFS stores xattr data in the inode. If the metadata is too large to fit in the inode, a new extent is created, which can cause quite a performance problem. Upping the inode size to 2048 bytes provides enough room to write the default metadata, plus a little headroom.
179
The following example mount options are recommended when using XFS:
180
mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8 /dev/sda1 /var/lib/Ceph/osd/Ceph-0
181
The following are specific recommendations for Intel SSD and Ceph. 
182
mkfs.xfs -f -K -i size=2048 -s size=4096 /dev/md0
183
/bin/mount -o noatime,nodiratime,nobarrier /dev/md0 /data/mysql
184
Disk read ahead
185
Read_ahead is the file prefetching technology used in the Linux operating system. It is a system call that loads a file's contents into the page cache. When a file is subsequently accessed, its contents are read from physical memory rather than from disk, which is much faster.
186
echo 2048 > /sys/block/${disk}/queue/read_ahead_kb  (default 128)
187
188
Per disk performance
189
128
190
512
191
%
192
Sequential Read(MB/s)
193
1232 MB/s
194
3251 MB/s
195
+163%
196
* 6 nodes Ceph cluster, each have 20 OSD (750 GB * 7200 RPM. 2.5’’ HDD)
197
198
OSD: RADOS
199
Tuning have significant performance impact of Ceph storage system, there are hundreds of tuning knobs for swift. We will introduce some of the most important tuning settings.
200
1. Large PG/PGP number (since Cuttlefish)
201
We find using large PG number per OSD (>200) will improve the performance. Also this will ease the data distribution unbalance issue
202
(default to 8)
203
ceph osd pool create testpool 8192 8192
204
205
2. omap data on separate disks (since Giant)
206
Mounting omap directory to some separate SSD will improve the random write performance. In our testing we saw a ~20% performance improvement.
207
208
3. objecter_inflight_ops/objecter_inflight_op_bytes (since Cuttlefish)
209
objecter_inflight_ops/objecter_inflight_op_bytes throttles tell objecter to throttle outgoing ops according its budget, objecter is responsible for send requests to OSD. By default tweak this parameter to 10x 
210
(default to 1024/1024*1024*100)
211
objecter_inflight_ops = 10240
212
objecter_inflight_op_bytes = 1048576000
213
214
4. ms_dispatch_throttle_bytes (since Cuttlefish)
215
ms_dispatch_throttle_bytes throttle is to throttle dispatch message size for simple messenger, by default tweak this parameter to 10x. 
216
ms_dispatch_throttle_bytes = 1048576000
217
218
5. journal_queue_max_bytes/journal_queue_max_ops (since Cuttlefish)
219
journal_queue_max_bytes/journal_queue_max_op throttles are to throttle inflight ops for journal, 
220
If journal does not get enough budget for current op, it will block osd op thread, by default tweak this parameter to 10x.
221
journal_queueu_max_ops = 3000
222
journal_queue_max_bytes = 1048576000
223
224
225
6. filestore_queue_max_ops/filestore_queue_max_bytes (since Cuttlefish)
226
filestore_queue_max_ops/filestore_queue_max_bytes throttle are used to throttle inflight ops for filestore, these throttles are checked before sending ops to journal, so if filestore does not get enough budget for current op, osd op thread will be blocked, by default tweak this parameter to 10x.
227
filestore_queue_max_ops=5000
228
filestore_queue_max_bytes = 1048576000
229
230
7. filestore_op_threads controls the number of filesystem operation threads that execute in parallel
231
If the storage backend is fast enough and has enough queues to support parallel operations, it’s recommended to increase this parameter, given there is enough CPU head room.
232
filestore_op_threads=6
233
234
8. journal_max_write_entries/journal_max_write_bytes (since Cuttlefish)
235
journal_max_write_entries/journal_max_write_bytes throttle are used to throttle ops or bytes for every journal write, tweaking these two parameters maybe helpful for small write, by default tweak these two parameters to 10x
236
journal_max_write_entries = 5000
237
journal_max_write_bytes = 1048576000
238
239
       
240
9. osd_op_num_threads_per_shard/osd_op_num_shards (since Firefly)
241
osd_op_num_shards set number of queues to cache requests , osd_op_num_threads_per_shard is    threads number for each queue,  adjusting these two parameters depends on cluster.
242
After several performance tests with different settings, we concluded that default parameters provide best performance.
243
244
10. filestore_max_sync_interval (since Cuttlefish)
245
filestore_max_sync_interval control the interval that sync thread flush data from memory to disk, by default filestore write data to memory and sync thread is responsible for flushing data to disk, then journal entries can be trimmed. Note that large filestore_max_sync_interval can cause performance spike. By default tweak this parameter to 10 seconds
246
filestore_max_sync_interval = 10
247
248
249
11. ms_crc_data/ms_crc_header (since Cuttlefish)
250
Disable crc computation for simple messenger, this can reduce CPU utilization
251
252
12. filestore_fd_cache_shards/filestore_fd_cache_size (since Firefly)
253
filestore cache is map from objectname to fd, filestore_fd_cache_shards set number of LRU Cache,  filestore_fd_cache_size is cache size, tweak these two parameter maybe reduce lookup time of fd
254
255
256
13. Set debug level to 0 (since Cuttlefish)
257
For an all-SSD Ceph cluster, set debug level for sub system to 0 will improve the performance.  
258
debug_lockdep = 0/0
259
debug_context = 0/0
260
debug_crush = 0/0
261
debug_buffer = 0/0
262
debug_timer = 0/0
263
debug_filer = 0/0
264
debug_objecter = 0/0
265
debug_rados = 0/0
266
debug_rbd = 0/0
267
debug_journaler = 0/0
268
debug_objectcatcher = 0/0
269
debug_client = 0/0
270
debug_osd = 0/0
271
debug_optracker = 0/0
272
debug_objclass = 0/0
273
debug_filestore = 0/0
274
debug_journal = 0/0
275
debug_ms = 0/0
276
debug_monc = 0/0
277
debug_tp = 0/0
278
debug_auth = 0/0
279
debug_finisher = 0/0
280
debug_heartbeatmap = 0/0
281
debug_perfcounter = 0/0
282
debug_asok = 0/0
283
debug_throttle = 0/0
284
debug_mon = 0/0
285
debug_paxos = 0/0
286
debug_rgw = 0/0
287
288
289
RBD Tuning
290
To help achieve low latency on their RBD layer, we suggest the following, in addition to the CERN tuning referenced in ceph.com. 
291
1) echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null 
292
2) start each ceph-osd in dedicated cgroup with dedicated cpu cores (which should be free from any other load, even the kernel one like network interrupts)
293
3) increase “filestore_omap_header_cache_size” • “filestore_fd_cache_size” , for better caching (16MB for each 500GB of storage)
294
For disk entry in libvirt  put address to all three ceph monitors.
295
296
RGW: Rados Gateway Tuning
297
1. Disable usage/access log (since Cuttlefish)
298
rgw enable ops log = false
299
rgw enable usage log = false
300
log file = /dev/null
301
We find disabling usage/access log improves the performance.
302
2. Using large cache size (since Cuttlefish)
303
rgw cache enabled = true
304
rgw cache lru size = 100000
305
Caching the hot objects improves the GET performance.
306
3. Using larger PG split/merge value.  (since Firefly)
307
filestore_merge_threshold = 500
308
filestore_split_multiple = 100
309
We find PG split/merge will introduce a big overhead. Using a large value would postpone the split/merge behavior. This will help the case where lots of small files are stored in the cluster.
310
4. Using load balancer with multiple RGW instances (since Cuttlefish)
311
312
We’ve found that the RGW has some scalability issues at present. With a single RGW instance the performance is poor. Running multiple RGW instances with a load balancer (e.g., Haproxy) will greatly improve the throughput.
313
5. Increase the number of Rados handlers (since Hammer)
314
Since Hammer it’s able to using multiple number of Rados handlers per RGW instances. Increasing this value should improve the performance.
315
6. Using Civetweb frontend (since Giant)
316
Before Giant, Apache + Libfastcgi were the recommended settings. However libfastcgi still use the very old ‘select’ mode, which is not able to handle large amount of concurrent IO in our testing. Using Civetweb frontend would help to improve the stability.
317
rgw frontends =civetweb port=80
318
319
7. Moving bucket index to SSD (since Giant)
320
Bucket index updating maybe some bottleneck if there’s millions of objects in one single bucket. We’ve find moving the bucket index to SSD storage will improve the performance.
321
322
8. Bucket Index Sharding (since Hammer)
323
We’ve find the bucket index sharding is a problem if there’s large amount of objects inside one bucket. However the index listing speed may be impacted.
324
325
Erasure Coding Tuning
326
1. Use larger stripe width 
327
The default erasure code stripe size (4K) is not optimal, We find using a bigger value (64K) will reduce the CPU% a lot (10%+)
328
osd_pool_erasure_code_stripe_width = 65536
329
330
2. Use mid-sized K
331
For the Erasure Code algorithms, we find using some mid-sized K value would bring balanced results between throughput and CPU%. We recommend to use 10+4 or 8+2 mode
332
Appendix
333
334
Sample Ceph.conf 
335
[global]
336
fsid = 35b08d01-b688-4b9a-947b-bc2e25719370
337
mon_initial_members = gw2
338
mon_host = 10.10.10.105
339
filestore_xattr_use_omap = true
340
auth_cluster_required = none
341
auth_service_required = none
342
auth_client_required = none
343
debug_lockdep = 0/0
344
debug_context = 0/0
345
debug_crush = 0/0
346
debug_buffer = 0/0
347
debug_timer = 0/0
348
debug_filer = 0/0
349
debug_objecter = 0/0
350
debug_rados = 0/0
351
debug_rbd = 0/0
352
debug_journaler = 0/0
353
debug_objectcatcher = 0/0
354
debug_client = 0/0
355
debug_osd = 0/0
356
debug_optracker = 0/0
357
debug_objclass = 0/0
358
debug_filestore = 0/0
359
debug_journal = 0/0
360
debug_ms = 0/0
361
debug_monc = 0/0
362
debug_tp = 0/0
363
debug_auth = 0/0
364
debug_finisher = 0/0
365
debug_heartbeatmap = 0/0
366
debug_perfcounter = 0/0
367
debug_asok = 0/0
368
debug_throttle = 0/0
369
debug_mon = 0/0
370
debug_paxos = 0/0
371
debug_rgw = 0/0
372
[mon]
373
mon_pg_warn_max_per_osd=5000
374
mon_max_pool_pg_num=106496
375
[client]
376
rbd cache = false
377
[osd]
378
osd mkfs type = xfs
379
osd mount options xfs = rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
380
osd mkfs options xfs = -f -i size=2048
381
filestore_queue_max_ops=5000
382
filestore_queue_max_bytes = 1048576000
383
filestore_max_sync_interval = 10
384
filestore_merge_threshold = 500
385
filestore_split_multiple = 100
386
osd_op_shard_threads = 8
387
journal_max_write_entries = 5000
388
journal_max_write_bytes = 1048576000
389
journal_queueu_max_ops = 3000
390
journal_queue_max_bytes = 1048576000
391
ms_dispatch_throttle_bytes = 1048576000
392
objecter_inflight_op_bytes = 1048576000
393
public_network = 10.10.10.100/24
394
cluster_network = 10.10.10.100/24
395
396
[client.radosgw.gw2-1]
397
host = gw2
398
keyring = /etc/ceph/ceph.client.radosgw.keyring
399
rgw cache enabled = true
400
rgw cache lru size = 100000
401
rgw socket path = /var/run/ceph/ceph.client.radosgw.gw2-1.fastcgi.sock
402
rgw thread pool size = 256
403
rgw enable ops log = false
404
rgw enable usage log = false
405
log file = /dev/null
406
rgw frontends =civetweb port=80
407
rgw override bucket index max shards = 8
408
409
Sample sysctl.conf 
410
fs.file-max = 6553600
411
net.ipv4.ip_local_port_range = 1024 65000
412
net.ipv4.tcp_fin_timeout = 20
413
net.ipv4.tcp_max_syn_backlog = 819200
414
net.ipv4.tcp_keepalive_time = 20
415
kernel.msgmni = 2878
416
kernel.sem = 256 32000 100 142
417
kernel.shmmni = 4096
418
net.core.rmem_default = 1048576
419
net.core.rmem_max = 1048576
420
net.core.wmem_default = 1048576
421
net.core.wmem_max = 1048576
422
net.core.somaxconn = 40000
423
net.core.netdev_max_backlog = 300000
424
net.ipv4.tcp_max_tw_buckets = 10000
425
426
All-NVMe Ceph Cluster Tuning for MySQL workload
427
Ceph.conf 
428
[global]
429
        enable experimental unrecoverable data corrupting features = bluestore rocksdb
430
        osd objectstore = bluestore
431
        ms_type = async
432
        rbd readahead disable after bytes = 0
433
        rbd readahead max bytes = 4194304
434
        bluestore default buffered read = true
435
        auth client required = none
436
        auth cluster required = none
437
        auth service required = none
438
        filestore xattr use omap = true
439
        cluster network = 192.168.142.0/24, 192.168.143.0/24
440
        private network = 192.168.144.0/24, 192.168.145.0/24
441
        log file = /var/log/ceph/$name.log
442
        log to syslog = false
443
        mon compact on trim = false
444
        osd pg bits = 8
445
        osd pgp bits = 8
446
        mon pg warn max object skew = 100000
447
        mon pg warn min per osd = 0
448
        mon pg warn max per osd = 32768
449
        debug_lockdep = 0/0
450
        debug_context = 0/0
451
        debug_crush = 0/0
452
        debug_buffer = 0/0
453
        debug_timer = 0/0
454
        debug_filer = 0/0
455
        debug_objecter = 0/0
456
        debug_rados = 0/0
457
        debug_rbd = 0/0
458
        debug_ms = 0/0
459
        debug_monc = 0/0
460
        debug_tp = 0/0
461
        debug_auth = 0/0
462
        debug_finisher = 0/0
463
        debug_heartbeatmap = 0/0
464
        debug_perfcounter = 0/0
465
        debug_asok = 0/0
466
        debug_throttle = 0/0
467
        debug_mon = 0/0
468
        debug_paxos = 0/0
469
        debug_rgw = 0/0
470
        perf = true
471
        mutex_perf_counter = true
472
        throttler_perf_counter = false
473
        rbd cache = false
474
[mon]
475
        mon data =/home/bmpa/tmp_cbt/ceph/mon.$id
476
        mon_max_pool_pg_num=166496
477
        mon_osd_max_split_count = 10000
478
        mon_pg_warn_max_per_osd = 10000
479
[mon.a]
480
        host = ft02
481
        mon addr = 192.168.142.202:6789
482
[osd]
483
        osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
484
        osd_mkfs_options_xfs = -f -i size=2048
485
        osd_op_threads = 32
486
        filestore_queue_max_ops=5000
487
        filestore_queue_committing_max_ops=5000
488
        journal_max_write_entries=1000
489
        journal_queue_max_ops=3000
490
        objecter_inflight_ops=102400
491
        filestore_wbthrottle_enable=false
492
        filestore_queue_max_bytes=1048576000
493
        filestore_queue_committing_max_bytes=1048576000
494
        journal_max_write_bytes=1048576000
495
        journal_queue_max_bytes=1048576000
496
        ms_dispatch_throttle_bytes=1048576000
497
        objecter_infilght_op_bytes=1048576000
498
        osd_mkfs_type = xfs
499
        filestore_max_sync_interval=10
500
        osd_client_message_size_cap = 0
501
        osd_client_message_cap = 0
502
        osd_enable_op_tracker = false
503
        filestore_fd_cache_size = 64
504
        filestore_fd_cache_shards = 32
505
        filestore_op_threads = 6
506
507
CBT YAML
508
cluster:
509
  user: "bmpa"
510
  head: "ft01"
511
  clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"]
512
  osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"]
513
  mons:
514
   ft02:
515
     a: "192.168.142.202:6789"
516
osds_per_node: 16
517
  fs: xfs
518
  mkfs_opts: '-f -i size=2048 -n size=64k'
519
  mount_opts: '-o inode64,noatime,logbsize=256k'
520
  conf_file: '/home/bmpa/cbt/ceph.conf'
521
  use_existing: False
522
  newstore_block: True
523
  rebuild_every_test: False
524
  clusterid: "ceph"
525
iterations: 1
526
  tmp_dir: "/home/bmpa/tmp_cbt"
527
pool_profiles:
528
    2rep:
529
      pg_size: 8192
530
      pgp_size: 8192
531
      replication: 2
532
benchmarks:
533
  librbdfio:
534
    time: 300
535
    ramp: 300
536
    vol_size: 10
537
    mode: ['randrw']
538
    rwmixread: [0,70,100]
539
    op_size: [4096]
540
    procs_per_volume: [1]
541
    volumes_per_client: [10]
542
    use_existing_volumes: False
543
    iodepth: [4,8,16,32,64,128]
544
    osd_ra: [4096]
545
    norandommap: True
546
    cmd_path: '/usr/local/bin/fio'
547
    pool_profile: '2rep'
548
    log_avg_msec: 250
549
550
MySQL configuration file (my.cnf)
551
[client]
552
port            = 3306
553
socket          = /var/run/mysqld/mysqld.sock
554
[mysqld_safe]
555
socket          = /var/run/mysqld/mysqld.sock
556
nice            = 0
557
[mysqld]
558
user            = mysql
559
pid-file        = /var/run/mysqld/mysqld.pid
560
socket          = /var/run/mysqld/mysqld.sock
561
port            = 3306
562
datadir         = /data
563
basedir         = /usr
564
tmpdir          = /tmp
565
lc-messages-dir = /usr/share/mysql
566
skip-external-locking
567
bind-address            = 0.0.0.0
568
max_allowed_packet      = 16M
569
thread_stack            = 192K
570
thread_cache_size       = 8
571
query_cache_limit       = 1M
572
query_cache_size        = 16M
573
log_error = /var/log/mysql/error.log
574
expire_logs_days        = 10
575
max_binlog_size         = 100M
576
performance_schema=off
577
innodb_buffer_pool_size = 25G
578
innodb_flush_method = O_DIRECT
579
innodb_log_file_size=4G
580
thread_cache_size=16
581
innodb_file_per_table
582
innodb_checksums = 0
583
innodb_flush_log_at_trx_commit = 0
584
innodb_write_io_threads = 8
585
innodb_page_cleaners= 16
586
innodb_read_io_threads = 8
587
max_connections = 50000
588
[mysqldump]
589
quick
590
quote-names
591
max_allowed_packet      = 16M
592
[mysql]
593
!includedir /etc/mysql/conf.d/
594
595
596
Sample Ceph Vendor Solutions
597
The following are pointers to Ceph solutions, but this list is not comprehensive:
598
https://www.dell.com/learn/us/en/04/shared-content~data-sheets~en/documents~dell-red-hat-cloud-solutions.pdf 
599
http://www.fujitsu.com/global/products/computing/storage/eternus-cd/
600
http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA5-2799ENW.pdf http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA5-8638ENW.pdf 
601
http://www.supermicro.com/solutions/storage_ceph.cfm 
602
https://www.thomas-krenn.com/en/products/storage-systems/suse-enterprise-storage.html
603
http://www.qct.io/Solution/Software-Defined-Infrastructure/Storage-Virtualization/QCT-and-Red-Hat-Ceph-Storage-p365c225c226c230 
604
605
Notices:
606
Copyright © 2016 Intel Corporation. All rights reserved
607
Intel, the Intel logo, Intel Atom, Intel Core, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
608
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
609
Intel® Hyper-Threading Technology available on select Intel® Core™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.
610
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
611
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. 
612
613
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. 
614
615
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. 
616
617
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. 
618
619
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.