Project

General

Profile

Tuning for All Flash Deployments » History » Version 14

Patrick McGarry, 01/14/2017 08:44 PM

1 1 Patrick McGarry
h1. Tuning for All Flash Deployments
2
3
Ceph Tuning and Best Practices for All Flash Intel® Xeon® Servers
4
Last updated:  January 2017
5
6
7
8
9
h2. Table of Contents
10 9 Patrick McGarry
11 14 Patrick McGarry
# +*[[Tuning_for_All_Flash_Deployments#Introduction|Introduction]]*+
12
# +*[[Tuning_for_All_Flash_Deployments#Ceph-Storage-Hardware-Guidelines|Ceph Storage Hardware Guidelines]]*+
13
# +*[[Tuning_for_All_Flash_Deployments#Intel-Tuning-and-Optimization-Recommendations-for-Ceph|Intel Tuning and Optimization Recommendations for Ceph]]*+
14
## +[[Tuning_for_All_Flash_Deployments#Server-Tuning|Server Tuning]]+
15
### +[[Tuning_for_All_Flash_Deployments#Ceph-Client-Configuration|Ceph Client Configuration]]+
16
### +[[Tuning_for_All_Flash_Deployments#Ceph-Storage-Node-NUMA-Tuning|Ceph Storage Node NUMA Tuning]]+
17
## +[[Tuning_for_All_Flash_Deployments#Memory-Tuning|Memory Tuning]]+
18
## +[[Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning|NVMe SSD partitioning]]+
19
## +[[Tuning_for_All_Flash_Deployments#OS-Tuning|OS Tuning]] (must be done on all Ceph nodes)+
20
### +[[Tuning_for_All_Flash_Deployments#Kernel-Tuning|Kernel Tuning]]+
21
### +[[Tuning_for_All_Flash_Deployments#Filesystem-considerations|Filesystem considerations]]+
22
### +[[Tuning_for_All_Flash_Deployments#Disk-read-ahead|Disk read ahead]]+
23
### +[[Tuning_for_All_Flash_Deployments#OSD-RADOS|OSD: RADOS]]+
24
## +[[Tuning_for_All_Flash_Deployments#RBD-Tuning|RBD Tuning]]+
25
## +[[Tuning_for_All_Flash_Deployments#RGW-Rados-Gateway-Tuning|RGW: Rados Gateway Tuning]]+
26
## +[[Tuning_for_All_Flash_Deployments#Erasure-Coding-Tuning|Erasure Coding Tuning]]+
27
# +*[[Tuning_for_All_Flash_Deployments#Appendix|Appendix]]*+
28
# +*[[Tuning_for_All_Flash_Deployments#Sample-Ceph-conf|Sample Ceph.conf]]*+
29
# +*[[Tuning_for_All_Flash_Deployments#Sample-sysctl-conf|Sample sysctl.conf]]*+
30
# +*[[Tuning_for_All_Flash_Deployments#All-NVMe-Ceph-Cluster-Tuning-for-MySQL-workload|All-NVMe Ceph Cluster Tuning for MySQL workload]]*+
31
## +[[Tuning_for_All_Flash_Deployments#Ceph-conf|Ceph.conf]]+
32
## +[[Tuning_for_All_Flash_Deployments#CBT-YAML|CBT YAML]]+
33
## +[[Tuning_for_All_Flash_Deployments#MySQL-configuration-file|MySQL configuration file]] (my.cnf)+
34
# +*[[Tuning_for_All_Flash_Deployments#Sample-Ceph-Vendor-Solutions|Sample Ceph Vendor Solutions]]*+
35 1 Patrick McGarry
36
37
38
39
h3. Introduction 
40
41 8 Patrick McGarry
Ceph is a scalable, open source, software-defined storage offering that runs on commodity hardware. Ceph has been developed from the ground up to deliver object, block, and file system storage in a single software platform that is self-managing, self-healing and has no single point of failure. Because of its highly scalable, software defined storage architecture, can be a powerful storage solution to consider. 
42
This document covers Ceph tuning guidelines specifically for all flash deployments based on extensive testing by Intel with a variety of system, operating system and Ceph optimizations to achieve highest possible performance for servers with Intel® Xeon® processors and Intel® Solid State Drive Data Center (Intel® SSD DC) Series. Details of OEM system SKUs and Ceph reference architectures for targeted use-cases can be found on ceph.com web-site. 
43
 
44 1 Patrick McGarry
h3. Ceph Storage Hardware Guidelines  
45 8 Patrick McGarry
46 1 Patrick McGarry
* *Standard* configuration is ideally suited for throughput oriented workloads (e.g., analytics,  DVR). Intel® SSD Data Center P3700 series is recommended to achieve best possible performance while balancing the cost.
47 8 Patrick McGarry
48
| CPU | Intel® Xeon® CPU E5-2650v4 or higher |
49
| Memory | Minimum of 64 GB|
50 1 Patrick McGarry
| NIC | 10GbE |
51 10 Patrick McGarry
| Disks | 1x 1.6TB P3700 + 12 x 4TB HDDs (1:12 ratio) / P3700 as Journal and caching |
52 8 Patrick McGarry
| Caching software | Intel Cache Acceleration Software for read caching, option: Intel® Rapid Storage Technology enterprise/MD4.3 |
53
54 1 Patrick McGarry
* *TCO optimized* configuration provides best possible performance for performance centric workloads (e.g., database) while achieving the TCO with a mix of SATA SSDs and NVMe SSDs. 
55 8 Patrick McGarry
56
| CPU | Intel® Xeon® CPU E5-2690v4 or higher |
57
| Memory | 128 GB or higher |
58
| NIC | Dual 10GbE 
59
| Disks | 1x 800GB P3700 + 4x S3510 1.6TB |
60
61
* *IOPS optimized* configuration provides best performance for workloads that demand low latency using all NVMe SSD configuration. 
62
63
| CPU | Intel® Xeon® CPU E5-2699v4 |
64
| Memory | 128 GB or higher |
65
| NIC | 1x 40GbE, 4x 10GbE |
66
| Disks | 4 x P3700 2TB |
67
68 12 Patrick McGarry
h3. Intel Tuning and Optimization Recommendations for Ceph
69 11 Patrick McGarry
70 12 Patrick McGarry
h3. Server Tuning
71 13 Patrick McGarry
Ceph Client Configuration
72 10 Patrick McGarry
73 1 Patrick McGarry
In a balanced system configuration both client and storage node configuration need to be optimized to get the best possible cluster performance. Care needs to be taken to ensure Ceph client node server has enough CPU bandwidth to achieve optimum performance. Below graph shows the end to end performance for different client CPU configurations for block workload.
74
75 8 Patrick McGarry
!1-cpu-cores-client.png!
76 1 Patrick McGarry
Figure 1: Client CPU cores and Ceph cluster impact
77 8 Patrick McGarry
78
h3. Ceph Storage Node NUMA Tuning
79
80 1 Patrick McGarry
In order to avoid latency, it is important to minimize inter-socket communication between NUMA nodes to service client IO as fast as possible and avoid latency penalty.  Based on extensive set of experiments conducted in Intel, it is recommended to pin Ceph OSD processes on the same CPU socket that has NVMe SSDs, HBAs and NIC devices attached. 
81 8 Patrick McGarry
82
!2-numa-mode-config.png! 
83 1 Patrick McGarry
Figure 2: NUMA node configuration and OSD assignment
84
85 8 Patrick McGarry
> *_Ceph startup scripts need change with setaffinity=" numactl --membind=0 --cpunodebind=0 "_
86
87 1 Patrick McGarry
Below performance data shows best possible cluster throughput and lower latency when Ceph OSDs are partitioned by CPU socket to manage media connected to local CPU socket and network IO not going through QPI link. 
88
89
!3-numa-node-perf-vs-default-sys-config.png!
90 8 Patrick McGarry
!3.1-table.png!
91 1 Patrick McGarry
Figure 3: NUMA node performance compared to default system configuration
92 8 Patrick McGarry
93 1 Patrick McGarry
h3. Memory Tuning
94
95
Ceph default packages use tcmalloc. For flash optimized configurations, we found jemalloc providing best possible performance without performance degradation over time. Ceph supports jemalloc for the hammer release and later releases but you need to build with jemalloc option enabled.
96
97
Below graph in figure 4 shows how thread cache size impacts throughput. By tuning thread cache size, performance is comparable between TCMalloc and JEMalloc. However as shown in Figure 5 and Figure 6, TCMalloc performance degrades over time unlike JEMalloc. 
98
99
!4-thread-cache-size-impact-over-perf.png!
100
Figure 4: Thread cache size impact over performance
101
102
!5.0-tcmalloc-over-time.png!
103
!5.1-tcmalloc-over-time.png!
104
Figure 5: TCMalloc performance in a running cluster over time
105 8 Patrick McGarry
106 1 Patrick McGarry
107
!6.0-jemalloc-over-time.png!
108 8 Patrick McGarry
!6.1-jemalloc-over-time.png!
109 1 Patrick McGarry
Figure 6: JEMalloc performance in a running cluster over time
110
111 8 Patrick McGarry
h3. NVMe SSD partitioning
112 14 Patrick McGarry
113 1 Patrick McGarry
It is not possible to take advantage of NVMe SSD bandwidth with single OSD.  4 is the optimum number of partitions per SSD drive that gives best possible performance. 
114
115 14 Patrick McGarry
!7.0-ceph-osd-latency-with-different-ssd-partitions.png!
116 8 Patrick McGarry
Figure 7: Ceph OSD latency with different SSD partitions
117
118 14 Patrick McGarry
!8-cpu-utilization-with-different-num-of-ssd-partitions.png!
119 1 Patrick McGarry
Figure 8: CPU Utilization with different #of SSD partitions 
120 8 Patrick McGarry
121 14 Patrick McGarry
122
h3. OS Tuning
123
124
*(must be done on all Ceph nodes)*
125
126
h3. Kernel Tuning
127
128
# Modify system control in /etc/sysctl.conf
129
<pre>
130
131 1 Patrick McGarry
# Kernel sysctl configuration file for Red Hat Linux
132
#
133
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
134
# sysctl.conf(5) for more details.
135
136
# Controls IP packet forwarding
137
net.ipv4.ip_forward = 0
138
139
# Controls source route verification
140
net.ipv4.conf.default.rp_filter = 1
141
142
# Do not accept source routing
143
net.ipv4.conf.default.accept_source_route = 0
144
145
# Controls the System Request debugging functionality of the kernel
146
kernel.sysrq = 0
147
148
# Controls whether core dumps will append the PID to the core filename.
149
# Useful for debugging multi-threaded applications.
150
kernel.core_uses_pid = 1
151
152
# disable TIME_WAIT.. wait ..
153
net.ipv4.tcp_tw_recycle = 1
154
net.ipv4.tcp_tw_reuse = 1
155
156
# Controls the use of TCP syncookies
157
net.ipv4.tcp_syncookies = 0
158
159
# double amount of allowed conntrack
160
net.netfilter.nf_conntrack_max = 2621440
161
net.netfilter.nf_conntrack_tcp_timeout_established = 1800
162
163
# Disable netfilter on bridges.
164
net.bridge.bridge-nf-call-ip6tables = 0
165
net.bridge.bridge-nf-call-iptables = 0
166
net.bridge.bridge-nf-call-arptables = 0
167
168
# Controls the maximum size of a message, in bytes
169
kernel.msgmnb = 65536
170
171
# Controls the default maxmimum size of a mesage queue
172
kernel.msgmax = 65536
173
174
# Controls the maximum shared segment size, in bytes
175
kernel.shmmax = 68719476736
176
177
# Controls the maximum number of shared memory segments, in pages
178
kernel.shmall = 4294967296
179 14 Patrick McGarry
</pre>
180 1 Patrick McGarry
181 14 Patrick McGarry
# IP jumbo frames
182
183 1 Patrick McGarry
If your switch supports jumbo frames, then the larger MTU size is helpful. Our tests showed 9000 MTU improves Sequential Read/Write performance.
184
185 14 Patrick McGarry
# Set the Linux disk scheduler to cfq
186
187
h3. Filesystem considerations
188
189 1 Patrick McGarry
Ceph is designed to be mostly filesystem agnostic–the only requirement being that the filesystem supports extended attributes (xattrs). Ceph OSDs depend on the Extended Attributes (XATTRs) of the underlying file system for: a) Internal object state b) Snapshot metadata c) RGW Access control Lists etc. Currently XFS is the recommended file system. We recommend using big inode size (default inode size is 256 bytes) when creating the file system:
190 14 Patrick McGarry
191 1 Patrick McGarry
mkfs.xfs –i size=2048 /dev/sda1
192
Setting the inode size is important, as XFS stores xattr data in the inode. If the metadata is too large to fit in the inode, a new extent is created, which can cause quite a performance problem. Upping the inode size to 2048 bytes provides enough room to write the default metadata, plus a little headroom.
193
The following example mount options are recommended when using XFS:
194
mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8 /dev/sda1 /var/lib/Ceph/osd/Ceph-0
195
The following are specific recommendations for Intel SSD and Ceph. 
196
mkfs.xfs -f -K -i size=2048 -s size=4096 /dev/md0
197
/bin/mount -o noatime,nodiratime,nobarrier /dev/md0 /data/mysql
198
Disk read ahead
199
Read_ahead is the file prefetching technology used in the Linux operating system. It is a system call that loads a file's contents into the page cache. When a file is subsequently accessed, its contents are read from physical memory rather than from disk, which is much faster.
200
echo 2048 > /sys/block/${disk}/queue/read_ahead_kb  (default 128)
201
202
Per disk performance
203
128
204
512
205
%
206
Sequential Read(MB/s)
207
1232 MB/s
208
3251 MB/s
209
+163%
210
* 6 nodes Ceph cluster, each have 20 OSD (750 GB * 7200 RPM. 2.5’’ HDD)
211
212
OSD: RADOS
213
Tuning have significant performance impact of Ceph storage system, there are hundreds of tuning knobs for swift. We will introduce some of the most important tuning settings.
214
1. Large PG/PGP number (since Cuttlefish)
215
We find using large PG number per OSD (>200) will improve the performance. Also this will ease the data distribution unbalance issue
216
(default to 8)
217
ceph osd pool create testpool 8192 8192
218
219
2. omap data on separate disks (since Giant)
220
Mounting omap directory to some separate SSD will improve the random write performance. In our testing we saw a ~20% performance improvement.
221
222
3. objecter_inflight_ops/objecter_inflight_op_bytes (since Cuttlefish)
223
objecter_inflight_ops/objecter_inflight_op_bytes throttles tell objecter to throttle outgoing ops according its budget, objecter is responsible for send requests to OSD. By default tweak this parameter to 10x 
224
(default to 1024/1024*1024*100)
225
objecter_inflight_ops = 10240
226
objecter_inflight_op_bytes = 1048576000
227
228
4. ms_dispatch_throttle_bytes (since Cuttlefish)
229
ms_dispatch_throttle_bytes throttle is to throttle dispatch message size for simple messenger, by default tweak this parameter to 10x. 
230
ms_dispatch_throttle_bytes = 1048576000
231
232
5. journal_queue_max_bytes/journal_queue_max_ops (since Cuttlefish)
233
journal_queue_max_bytes/journal_queue_max_op throttles are to throttle inflight ops for journal, 
234
If journal does not get enough budget for current op, it will block osd op thread, by default tweak this parameter to 10x.
235
journal_queueu_max_ops = 3000
236
journal_queue_max_bytes = 1048576000
237
238
239
6. filestore_queue_max_ops/filestore_queue_max_bytes (since Cuttlefish)
240
filestore_queue_max_ops/filestore_queue_max_bytes throttle are used to throttle inflight ops for filestore, these throttles are checked before sending ops to journal, so if filestore does not get enough budget for current op, osd op thread will be blocked, by default tweak this parameter to 10x.
241
filestore_queue_max_ops=5000
242
filestore_queue_max_bytes = 1048576000
243
244
7. filestore_op_threads controls the number of filesystem operation threads that execute in parallel
245
If the storage backend is fast enough and has enough queues to support parallel operations, it’s recommended to increase this parameter, given there is enough CPU head room.
246
filestore_op_threads=6
247
248
8. journal_max_write_entries/journal_max_write_bytes (since Cuttlefish)
249
journal_max_write_entries/journal_max_write_bytes throttle are used to throttle ops or bytes for every journal write, tweaking these two parameters maybe helpful for small write, by default tweak these two parameters to 10x
250
journal_max_write_entries = 5000
251
journal_max_write_bytes = 1048576000
252
253
       
254
9. osd_op_num_threads_per_shard/osd_op_num_shards (since Firefly)
255
osd_op_num_shards set number of queues to cache requests , osd_op_num_threads_per_shard is    threads number for each queue,  adjusting these two parameters depends on cluster.
256
After several performance tests with different settings, we concluded that default parameters provide best performance.
257
258
10. filestore_max_sync_interval (since Cuttlefish)
259
filestore_max_sync_interval control the interval that sync thread flush data from memory to disk, by default filestore write data to memory and sync thread is responsible for flushing data to disk, then journal entries can be trimmed. Note that large filestore_max_sync_interval can cause performance spike. By default tweak this parameter to 10 seconds
260
filestore_max_sync_interval = 10
261
262
263
11. ms_crc_data/ms_crc_header (since Cuttlefish)
264
Disable crc computation for simple messenger, this can reduce CPU utilization
265
266
12. filestore_fd_cache_shards/filestore_fd_cache_size (since Firefly)
267
filestore cache is map from objectname to fd, filestore_fd_cache_shards set number of LRU Cache,  filestore_fd_cache_size is cache size, tweak these two parameter maybe reduce lookup time of fd
268
269
270
13. Set debug level to 0 (since Cuttlefish)
271
For an all-SSD Ceph cluster, set debug level for sub system to 0 will improve the performance.  
272
debug_lockdep = 0/0
273
debug_context = 0/0
274
debug_crush = 0/0
275
debug_buffer = 0/0
276
debug_timer = 0/0
277
debug_filer = 0/0
278
debug_objecter = 0/0
279
debug_rados = 0/0
280
debug_rbd = 0/0
281
debug_journaler = 0/0
282
debug_objectcatcher = 0/0
283
debug_client = 0/0
284
debug_osd = 0/0
285
debug_optracker = 0/0
286
debug_objclass = 0/0
287
debug_filestore = 0/0
288
debug_journal = 0/0
289
debug_ms = 0/0
290
debug_monc = 0/0
291
debug_tp = 0/0
292
debug_auth = 0/0
293
debug_finisher = 0/0
294
debug_heartbeatmap = 0/0
295
debug_perfcounter = 0/0
296
debug_asok = 0/0
297
debug_throttle = 0/0
298
debug_mon = 0/0
299
debug_paxos = 0/0
300
debug_rgw = 0/0
301
302
303
RBD Tuning
304
To help achieve low latency on their RBD layer, we suggest the following, in addition to the CERN tuning referenced in ceph.com. 
305
1) echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null 
306
2) start each ceph-osd in dedicated cgroup with dedicated cpu cores (which should be free from any other load, even the kernel one like network interrupts)
307
3) increase “filestore_omap_header_cache_size” • “filestore_fd_cache_size” , for better caching (16MB for each 500GB of storage)
308
For disk entry in libvirt  put address to all three ceph monitors.
309
310
RGW: Rados Gateway Tuning
311
1. Disable usage/access log (since Cuttlefish)
312
rgw enable ops log = false
313
rgw enable usage log = false
314
log file = /dev/null
315
We find disabling usage/access log improves the performance.
316
2. Using large cache size (since Cuttlefish)
317
rgw cache enabled = true
318
rgw cache lru size = 100000
319
Caching the hot objects improves the GET performance.
320
3. Using larger PG split/merge value.  (since Firefly)
321
filestore_merge_threshold = 500
322
filestore_split_multiple = 100
323
We find PG split/merge will introduce a big overhead. Using a large value would postpone the split/merge behavior. This will help the case where lots of small files are stored in the cluster.
324
4. Using load balancer with multiple RGW instances (since Cuttlefish)
325
326
We’ve found that the RGW has some scalability issues at present. With a single RGW instance the performance is poor. Running multiple RGW instances with a load balancer (e.g., Haproxy) will greatly improve the throughput.
327
5. Increase the number of Rados handlers (since Hammer)
328
Since Hammer it’s able to using multiple number of Rados handlers per RGW instances. Increasing this value should improve the performance.
329
6. Using Civetweb frontend (since Giant)
330
Before Giant, Apache + Libfastcgi were the recommended settings. However libfastcgi still use the very old ‘select’ mode, which is not able to handle large amount of concurrent IO in our testing. Using Civetweb frontend would help to improve the stability.
331
rgw frontends =civetweb port=80
332
333
7. Moving bucket index to SSD (since Giant)
334
Bucket index updating maybe some bottleneck if there’s millions of objects in one single bucket. We’ve find moving the bucket index to SSD storage will improve the performance.
335
336
8. Bucket Index Sharding (since Hammer)
337
We’ve find the bucket index sharding is a problem if there’s large amount of objects inside one bucket. However the index listing speed may be impacted.
338
339
Erasure Coding Tuning
340
1. Use larger stripe width 
341
The default erasure code stripe size (4K) is not optimal, We find using a bigger value (64K) will reduce the CPU% a lot (10%+)
342
osd_pool_erasure_code_stripe_width = 65536
343
344
2. Use mid-sized K
345
For the Erasure Code algorithms, we find using some mid-sized K value would bring balanced results between throughput and CPU%. We recommend to use 10+4 or 8+2 mode
346
Appendix
347
348
Sample Ceph.conf 
349
[global]
350
fsid = 35b08d01-b688-4b9a-947b-bc2e25719370
351
mon_initial_members = gw2
352
mon_host = 10.10.10.105
353
filestore_xattr_use_omap = true
354
auth_cluster_required = none
355
auth_service_required = none
356
auth_client_required = none
357
debug_lockdep = 0/0
358
debug_context = 0/0
359
debug_crush = 0/0
360
debug_buffer = 0/0
361
debug_timer = 0/0
362
debug_filer = 0/0
363
debug_objecter = 0/0
364
debug_rados = 0/0
365
debug_rbd = 0/0
366
debug_journaler = 0/0
367
debug_objectcatcher = 0/0
368
debug_client = 0/0
369
debug_osd = 0/0
370
debug_optracker = 0/0
371
debug_objclass = 0/0
372
debug_filestore = 0/0
373
debug_journal = 0/0
374
debug_ms = 0/0
375
debug_monc = 0/0
376
debug_tp = 0/0
377
debug_auth = 0/0
378
debug_finisher = 0/0
379
debug_heartbeatmap = 0/0
380
debug_perfcounter = 0/0
381
debug_asok = 0/0
382
debug_throttle = 0/0
383
debug_mon = 0/0
384
debug_paxos = 0/0
385
debug_rgw = 0/0
386
[mon]
387
mon_pg_warn_max_per_osd=5000
388
mon_max_pool_pg_num=106496
389
[client]
390
rbd cache = false
391
[osd]
392
osd mkfs type = xfs
393
osd mount options xfs = rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
394
osd mkfs options xfs = -f -i size=2048
395
filestore_queue_max_ops=5000
396
filestore_queue_max_bytes = 1048576000
397
filestore_max_sync_interval = 10
398
filestore_merge_threshold = 500
399
filestore_split_multiple = 100
400
osd_op_shard_threads = 8
401
journal_max_write_entries = 5000
402
journal_max_write_bytes = 1048576000
403
journal_queueu_max_ops = 3000
404
journal_queue_max_bytes = 1048576000
405
ms_dispatch_throttle_bytes = 1048576000
406
objecter_inflight_op_bytes = 1048576000
407
public_network = 10.10.10.100/24
408
cluster_network = 10.10.10.100/24
409
410
[client.radosgw.gw2-1]
411
host = gw2
412
keyring = /etc/ceph/ceph.client.radosgw.keyring
413
rgw cache enabled = true
414
rgw cache lru size = 100000
415
rgw socket path = /var/run/ceph/ceph.client.radosgw.gw2-1.fastcgi.sock
416
rgw thread pool size = 256
417
rgw enable ops log = false
418
rgw enable usage log = false
419
log file = /dev/null
420
rgw frontends =civetweb port=80
421
rgw override bucket index max shards = 8
422
423
Sample sysctl.conf 
424
fs.file-max = 6553600
425
net.ipv4.ip_local_port_range = 1024 65000
426
net.ipv4.tcp_fin_timeout = 20
427
net.ipv4.tcp_max_syn_backlog = 819200
428
net.ipv4.tcp_keepalive_time = 20
429
kernel.msgmni = 2878
430
kernel.sem = 256 32000 100 142
431
kernel.shmmni = 4096
432
net.core.rmem_default = 1048576
433
net.core.rmem_max = 1048576
434
net.core.wmem_default = 1048576
435
net.core.wmem_max = 1048576
436
net.core.somaxconn = 40000
437
net.core.netdev_max_backlog = 300000
438
net.ipv4.tcp_max_tw_buckets = 10000
439
440
All-NVMe Ceph Cluster Tuning for MySQL workload
441
Ceph.conf 
442
[global]
443
        enable experimental unrecoverable data corrupting features = bluestore rocksdb
444
        osd objectstore = bluestore
445
        ms_type = async
446
        rbd readahead disable after bytes = 0
447
        rbd readahead max bytes = 4194304
448
        bluestore default buffered read = true
449
        auth client required = none
450
        auth cluster required = none
451
        auth service required = none
452
        filestore xattr use omap = true
453
        cluster network = 192.168.142.0/24, 192.168.143.0/24
454
        private network = 192.168.144.0/24, 192.168.145.0/24
455
        log file = /var/log/ceph/$name.log
456
        log to syslog = false
457
        mon compact on trim = false
458
        osd pg bits = 8
459
        osd pgp bits = 8
460
        mon pg warn max object skew = 100000
461
        mon pg warn min per osd = 0
462
        mon pg warn max per osd = 32768
463
        debug_lockdep = 0/0
464
        debug_context = 0/0
465
        debug_crush = 0/0
466
        debug_buffer = 0/0
467
        debug_timer = 0/0
468
        debug_filer = 0/0
469
        debug_objecter = 0/0
470
        debug_rados = 0/0
471
        debug_rbd = 0/0
472
        debug_ms = 0/0
473
        debug_monc = 0/0
474
        debug_tp = 0/0
475
        debug_auth = 0/0
476
        debug_finisher = 0/0
477
        debug_heartbeatmap = 0/0
478
        debug_perfcounter = 0/0
479
        debug_asok = 0/0
480
        debug_throttle = 0/0
481
        debug_mon = 0/0
482
        debug_paxos = 0/0
483
        debug_rgw = 0/0
484
        perf = true
485
        mutex_perf_counter = true
486
        throttler_perf_counter = false
487
        rbd cache = false
488
[mon]
489
        mon data =/home/bmpa/tmp_cbt/ceph/mon.$id
490
        mon_max_pool_pg_num=166496
491
        mon_osd_max_split_count = 10000
492
        mon_pg_warn_max_per_osd = 10000
493
[mon.a]
494
        host = ft02
495
        mon addr = 192.168.142.202:6789
496
[osd]
497
        osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
498
        osd_mkfs_options_xfs = -f -i size=2048
499
        osd_op_threads = 32
500
        filestore_queue_max_ops=5000
501
        filestore_queue_committing_max_ops=5000
502
        journal_max_write_entries=1000
503
        journal_queue_max_ops=3000
504
        objecter_inflight_ops=102400
505
        filestore_wbthrottle_enable=false
506
        filestore_queue_max_bytes=1048576000
507
        filestore_queue_committing_max_bytes=1048576000
508
        journal_max_write_bytes=1048576000
509
        journal_queue_max_bytes=1048576000
510
        ms_dispatch_throttle_bytes=1048576000
511
        objecter_infilght_op_bytes=1048576000
512
        osd_mkfs_type = xfs
513
        filestore_max_sync_interval=10
514
        osd_client_message_size_cap = 0
515
        osd_client_message_cap = 0
516
        osd_enable_op_tracker = false
517
        filestore_fd_cache_size = 64
518
        filestore_fd_cache_shards = 32
519
        filestore_op_threads = 6
520
521
CBT YAML
522
cluster:
523
  user: "bmpa"
524
  head: "ft01"
525
  clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"]
526
  osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"]
527
  mons:
528
   ft02:
529
     a: "192.168.142.202:6789"
530
osds_per_node: 16
531
  fs: xfs
532
  mkfs_opts: '-f -i size=2048 -n size=64k'
533
  mount_opts: '-o inode64,noatime,logbsize=256k'
534
  conf_file: '/home/bmpa/cbt/ceph.conf'
535
  use_existing: False
536
  newstore_block: True
537
  rebuild_every_test: False
538
  clusterid: "ceph"
539
iterations: 1
540
  tmp_dir: "/home/bmpa/tmp_cbt"
541
pool_profiles:
542
    2rep:
543
      pg_size: 8192
544
      pgp_size: 8192
545
      replication: 2
546
benchmarks:
547
  librbdfio:
548
    time: 300
549
    ramp: 300
550
    vol_size: 10
551
    mode: ['randrw']
552
    rwmixread: [0,70,100]
553
    op_size: [4096]
554
    procs_per_volume: [1]
555
    volumes_per_client: [10]
556
    use_existing_volumes: False
557
    iodepth: [4,8,16,32,64,128]
558
    osd_ra: [4096]
559
    norandommap: True
560
    cmd_path: '/usr/local/bin/fio'
561
    pool_profile: '2rep'
562
    log_avg_msec: 250
563
564
MySQL configuration file (my.cnf)
565
[client]
566
port            = 3306
567
socket          = /var/run/mysqld/mysqld.sock
568
[mysqld_safe]
569
socket          = /var/run/mysqld/mysqld.sock
570
nice            = 0
571
[mysqld]
572
user            = mysql
573
pid-file        = /var/run/mysqld/mysqld.pid
574
socket          = /var/run/mysqld/mysqld.sock
575
port            = 3306
576
datadir         = /data
577
basedir         = /usr
578
tmpdir          = /tmp
579
lc-messages-dir = /usr/share/mysql
580
skip-external-locking
581
bind-address            = 0.0.0.0
582
max_allowed_packet      = 16M
583
thread_stack            = 192K
584
thread_cache_size       = 8
585
query_cache_limit       = 1M
586
query_cache_size        = 16M
587
log_error = /var/log/mysql/error.log
588
expire_logs_days        = 10
589
max_binlog_size         = 100M
590
performance_schema=off
591
innodb_buffer_pool_size = 25G
592
innodb_flush_method = O_DIRECT
593
innodb_log_file_size=4G
594
thread_cache_size=16
595
innodb_file_per_table
596
innodb_checksums = 0
597
innodb_flush_log_at_trx_commit = 0
598
innodb_write_io_threads = 8
599
innodb_page_cleaners= 16
600
innodb_read_io_threads = 8
601
max_connections = 50000
602
[mysqldump]
603
quick
604
quote-names
605
max_allowed_packet      = 16M
606
[mysql]
607
!includedir /etc/mysql/conf.d/
608
609
610
Sample Ceph Vendor Solutions
611
The following are pointers to Ceph solutions, but this list is not comprehensive:
612
https://www.dell.com/learn/us/en/04/shared-content~data-sheets~en/documents~dell-red-hat-cloud-solutions.pdf 
613
http://www.fujitsu.com/global/products/computing/storage/eternus-cd/
614
http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA5-2799ENW.pdf http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA5-8638ENW.pdf 
615
http://www.supermicro.com/solutions/storage_ceph.cfm 
616
https://www.thomas-krenn.com/en/products/storage-systems/suse-enterprise-storage.html
617
http://www.qct.io/Solution/Software-Defined-Infrastructure/Storage-Virtualization/QCT-and-Red-Hat-Ceph-Storage-p365c225c226c230 
618
619
Notices:
620
Copyright © 2016 Intel Corporation. All rights reserved
621
Intel, the Intel logo, Intel Atom, Intel Core, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
622
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
623
Intel® Hyper-Threading Technology available on select Intel® Core™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.
624
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
625
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. 
626
627
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. 
628
629
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. 
630
631
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. 
632
633
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.