Version 7 - History - Tuning for All Flash Deployments - Ceph - Ceph

1

Patrick McGarry

h1. Tuning for All Flash Deployments

2

3

Ceph Tuning and Best Practices for All Flash Intel® Xeon® Servers

4

Last updated:  January 2017

5

6

7

8

9

h2. Table of Contents

10

12

1

Patrick McGarry

Ceph Storage Hardware Guidelines

13

Intel Tuning and Optimization Recommendations for Ceph

14

Server Tuning

15

Ceph Client Configuration

16

Ceph Storage Node NUMA Tuning

17

Memory Tuning

18

NVMe SSD partitioning

19

OS Tuning (must be done on all Ceph nodes)

20

Kernel Tuning

21

Filesystem considerations

22

Disk read ahead

23

OSD: RADOS

24

RBD Tuning

25

RGW: Rados Gateway Tuning

26

Erasure Coding Tuning

27

Appendix

28

Sample Ceph.conf

29

Sample sysctl.conf

30

All-NVMe Ceph Cluster Tuning for MySQL workload

31

Ceph.conf

32

CBT YAML

33

MySQL configuration file (my.cnf)

34

Sample Ceph Vendor Solutions

35

36

37

38

39

h3. Introduction

40

41

Ceph is a scalable, open source, software-defined storage offering that runs on commodity hardware. Ceph has been developed from the ground up to deliver object, block, and file system storage in a single software platform that is self-managing, self-healing and has no single point of failure. Because of its highly scalable, software defined storage architecture, can be a powerful storage solution to consider.

42

This document covers Ceph tuning guidelines specifically for all flash deployments based on extensive testing by Intel with a variety of system, operating system and Ceph optimizations to achieve highest possible performance for servers with Intel® Xeon® processors and Intel® Solid State Drive Data Center (Intel® SSD DC) Series. Details of OEM system SKUs and Ceph reference architectures for targeted use-cases can be found on ceph.com web-site.

43

Ceph Storage Hardware Guidelines

44

Standard configuration is ideally suited for throughput oriented workloads (e.g., analytics,  DVR). Intel® SSD Data Center P3700 series is recommended to achieve best possible performance while balancing the cost.

45

CPU

46

Intel® Xeon® CPU E5-2650v4 or higher

47

Memory

48

Minimum of 64 GB

49

NIC

50

10GbE

51

Disks

52

1x 1.6TB P3700 + 12 x 4TB HDDs (1:12 ratio)

53

P3700 as Journal and caching

54

Caching software

55

Intel Cache Acceleration Software for read caching, option: Intel® Rapid Storage Technology enterprise/MD4.3

56

57

TCO optimized configuration provides best possible performance for performance centric workloads (e.g., database) while achieving the TCO with a mix of SATA SSDs and NVMe SSDs.

58

CPU

59

Intel® Xeon® CPU E5-2690v4 or higher

60

Memory

61

128 GB or higher

62

NIC

63

Dual 10GbE

64

Disks

65

1x 800GB P3700 + 4x S3510 1.6TB

66

67

IOPS optimized configuration provides best performance for workloads that demand low latency using all NVMe SSD configuration.

68

CPU

69

Intel® Xeon® CPU E5-2699v4

70

Memory

71

128 GB or higher

72

NIC

73

1x 40GbE, 4x 10GbE

74

Disks

75

 4 x P3700 2TB

76

77

Intel Tuning and Optimization Recommendations for Ceph

78

Server Tuning

79

Ceph Client Configuration

80

In a balanced system configuration both client and storage node configuration need to be optimized to get the best possible cluster performance. Care needs to be taken to ensure Ceph client node server has enough CPU bandwidth to achieve optimum performance. Below graph shows the end to end performance for different client CPU configurations for block workload.

81

82

Figure 1: Client CPU cores and Ceph cluster impact

83

Ceph Storage Node NUMA Tuning

84

In order to avoid latency, it is important to minimize inter-socket communication between NUMA nodes to service client IO as fast as possible and avoid latency penalty.  Based on extensive set of experiments conducted in Intel, it is recommended to pin Ceph OSD processes on the same CPU socket that has NVMe SSDs, HBAs and NIC devices attached.

85

86

Figure 2: NUMA node configuration and OSD assignment

87

Ceph startup scripts need change with setaffinity=" numactl --membind=0 --cpunodebind=0 "

88

89

Below performance data shows best possible cluster throughput and lower latency when Ceph OSDs are partitioned by CPU socket to manage media connected to local CPU socket and network IO not going through QPI link.

90

91

92

Figure 3: NUMA node performance compared to default system configuration

93

Memory Tuning

94

Ceph default packages use tcmalloc. For flash optimized configurations, we found jemalloc providing best possible performance without performance degradation over time. Ceph supports jemalloc for the hammer release and later releases but you need to build with jemalloc option enabled.

95

Below graph in figure 4 shows how thread cache size impacts throughput. By tuning thread cache size, performance is comparable between TCMalloc and JEMalloc. However as shown in Figure 5 and Figure 6, TCMalloc performance degrades over time unlike JEMalloc.

96

97

98

Figure 4: Thread cache size impact over performance

99

100

101

Figure 5: TCMalloc performance in a running cluster over time

102

103

104

105

Figure 6: JEMalloc performance in a running cluster over time

106

107

NVMe SSD partitioning

108

It is not possible to take advantage of NVMe SSD bandwidth with single OSD.  4 is the optimum number of partitions per SSD drive that gives best possible performance.

109

110

Figure 7: Ceph OSD latency with different SSD partitions

111

112

Figure 8: CPU Utilization with different #of SSD partitions

113

114

OS Tuning (must be done on all Ceph nodes)

115

Kernel Tuning

116

1. Modify system control in /etc/sysctl.conf

117

# Kernel sysctl configuration file for Red Hat Linux

118

119

# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and

120

# sysctl.conf(5) for more details.

121

122

# Controls IP packet forwarding

123

net.ipv4.ip_forward = 0

124

125

# Controls source route verification

126

net.ipv4.conf.default.rp_filter = 1

127

128

# Do not accept source routing

129

net.ipv4.conf.default.accept_source_route = 0

130

131

# Controls the System Request debugging functionality of the kernel

132

kernel.sysrq = 0

133

134

# Controls whether core dumps will append the PID to the core filename.

135

# Useful for debugging multi-threaded applications.

136

kernel.core_uses_pid = 1

137

138

# disable TIME_WAIT.. wait ..

139

net.ipv4.tcp_tw_recycle = 1

140

net.ipv4.tcp_tw_reuse = 1

141

142

# Controls the use of TCP syncookies

143

net.ipv4.tcp_syncookies = 0

144

145

# double amount of allowed conntrack

146

net.netfilter.nf_conntrack_max = 2621440

147

net.netfilter.nf_conntrack_tcp_timeout_established = 1800

148

149

# Disable netfilter on bridges.

150

net.bridge.bridge-nf-call-ip6tables = 0

151

net.bridge.bridge-nf-call-iptables = 0

152

net.bridge.bridge-nf-call-arptables = 0

153

154

# Controls the maximum size of a message, in bytes

155

kernel.msgmnb = 65536

156

157

# Controls the default maxmimum size of a mesage queue

158

kernel.msgmax = 65536

159

160

# Controls the maximum shared segment size, in bytes

161

kernel.shmmax = 68719476736

162

163

# Controls the maximum number of shared memory segments, in pages

164

kernel.shmall = 4294967296

165

166

2. IP jumbo frames

167

If your switch supports jumbo frames, then the larger MTU size is helpful. Our tests showed 9000 MTU improves Sequential Read/Write performance.

168

169

3. Set the Linux disk scheduler to cfq

170

Filesystem considerations

171

Ceph is designed to be mostly filesystem agnostic–the only requirement being that the filesystem supports extended attributes (xattrs). Ceph OSDs depend on the Extended Attributes (XATTRs) of the underlying file system for: a) Internal object state b) Snapshot metadata c) RGW Access control Lists etc. Currently XFS is the recommended file system. We recommend using big inode size (default inode size is 256 bytes) when creating the file system:

172

mkfs.xfs –i size=2048 /dev/sda1

173

Setting the inode size is important, as XFS stores xattr data in the inode. If the metadata is too large to fit in the inode, a new extent is created, which can cause quite a performance problem. Upping the inode size to 2048 bytes provides enough room to write the default metadata, plus a little headroom.

174

The following example mount options are recommended when using XFS:

175

mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8 /dev/sda1 /var/lib/Ceph/osd/Ceph-0

176

The following are specific recommendations for Intel SSD and Ceph.

177

mkfs.xfs -f -K -i size=2048 -s size=4096 /dev/md0

178

/bin/mount -o noatime,nodiratime,nobarrier /dev/md0 /data/mysql

179

Disk read ahead

180

Read_ahead is the file prefetching technology used in the Linux operating system. It is a system call that loads a file's contents into the page cache. When a file is subsequently accessed, its contents are read from physical memory rather than from disk, which is much faster.

181

echo 2048 > /sys/block/${disk}/queue/read_ahead_kb  (default 128)

182

183

Per disk performance

184

185

186

187

Sequential Read(MB/s)

188

1232 MB/s

189

3251 MB/s

190

+163%

191

* 6 nodes Ceph cluster, each have 20 OSD (750 GB * 7200 RPM. 2.5’’ HDD)

192

193

OSD: RADOS

194

Tuning have significant performance impact of Ceph storage system, there are hundreds of tuning knobs for swift. We will introduce some of the most important tuning settings.

195

1. Large PG/PGP number (since Cuttlefish)

196

We find using large PG number per OSD (>200) will improve the performance. Also this will ease the data distribution unbalance issue

197

(default to 8)

198

ceph osd pool create testpool 8192 8192

199

200

2. omap data on separate disks (since Giant)

201

Mounting omap directory to some separate SSD will improve the random write performance. In our testing we saw a ~20% performance improvement.

202

203

3. objecter_inflight_ops/objecter_inflight_op_bytes (since Cuttlefish)

204

objecter_inflight_ops/objecter_inflight_op_bytes throttles tell objecter to throttle outgoing ops according its budget, objecter is responsible for send requests to OSD. By default tweak this parameter to 10x

205

(default to 1024/1024*1024*100)

206

objecter_inflight_ops = 10240

207

objecter_inflight_op_bytes = 1048576000

208

209

4. ms_dispatch_throttle_bytes (since Cuttlefish)

210

ms_dispatch_throttle_bytes throttle is to throttle dispatch message size for simple messenger, by default tweak this parameter to 10x.

211

ms_dispatch_throttle_bytes = 1048576000

212

213

5. journal_queue_max_bytes/journal_queue_max_ops (since Cuttlefish)

214

journal_queue_max_bytes/journal_queue_max_op throttles are to throttle inflight ops for journal,

215

If journal does not get enough budget for current op, it will block osd op thread, by default tweak this parameter to 10x.

216

journal_queueu_max_ops = 3000

217

journal_queue_max_bytes = 1048576000

218

219

220

6. filestore_queue_max_ops/filestore_queue_max_bytes (since Cuttlefish)

221

filestore_queue_max_ops/filestore_queue_max_bytes throttle are used to throttle inflight ops for filestore, these throttles are checked before sending ops to journal, so if filestore does not get enough budget for current op, osd op thread will be blocked, by default tweak this parameter to 10x.

222

filestore_queue_max_ops=5000

223

filestore_queue_max_bytes = 1048576000

224

225

7. filestore_op_threads controls the number of filesystem operation threads that execute in parallel

226

If the storage backend is fast enough and has enough queues to support parallel operations, it’s recommended to increase this parameter, given there is enough CPU head room.

227

filestore_op_threads=6

228

229

8. journal_max_write_entries/journal_max_write_bytes (since Cuttlefish)

230

journal_max_write_entries/journal_max_write_bytes throttle are used to throttle ops or bytes for every journal write, tweaking these two parameters maybe helpful for small write, by default tweak these two parameters to 10x

231

journal_max_write_entries = 5000

232

journal_max_write_bytes = 1048576000

233

234

235

9. osd_op_num_threads_per_shard/osd_op_num_shards (since Firefly)

236

osd_op_num_shards set number of queues to cache requests , osd_op_num_threads_per_shard is    threads number for each queue,  adjusting these two parameters depends on cluster.

237

After several performance tests with different settings, we concluded that default parameters provide best performance.

238

239

10. filestore_max_sync_interval (since Cuttlefish)

240

filestore_max_sync_interval control the interval that sync thread flush data from memory to disk, by default filestore write data to memory and sync thread is responsible for flushing data to disk, then journal entries can be trimmed. Note that large filestore_max_sync_interval can cause performance spike. By default tweak this parameter to 10 seconds

241

filestore_max_sync_interval = 10

242

243

244

11. ms_crc_data/ms_crc_header (since Cuttlefish)

245

Disable crc computation for simple messenger, this can reduce CPU utilization

246

247

12. filestore_fd_cache_shards/filestore_fd_cache_size (since Firefly)

248

filestore cache is map from objectname to fd, filestore_fd_cache_shards set number of LRU Cache,  filestore_fd_cache_size is cache size, tweak these two parameter maybe reduce lookup time of fd

249

250

251

13. Set debug level to 0 (since Cuttlefish)

252

For an all-SSD Ceph cluster, set debug level for sub system to 0 will improve the performance.

253

debug_lockdep = 0/0

254

debug_context = 0/0

255

debug_crush = 0/0

256

debug_buffer = 0/0

257

debug_timer = 0/0

258

debug_filer = 0/0

259

debug_objecter = 0/0

260

debug_rados = 0/0

261

debug_rbd = 0/0

262

debug_journaler = 0/0

263

debug_objectcatcher = 0/0

264

debug_client = 0/0

265

debug_osd = 0/0

266

debug_optracker = 0/0

267

debug_objclass = 0/0

268

debug_filestore = 0/0

269

debug_journal = 0/0

270

debug_ms = 0/0

271

debug_monc = 0/0

272

debug_tp = 0/0

273

debug_auth = 0/0

274

debug_finisher = 0/0

275

debug_heartbeatmap = 0/0

276

debug_perfcounter = 0/0

277

debug_asok = 0/0

278

debug_throttle = 0/0

279

debug_mon = 0/0

280

debug_paxos = 0/0

281

debug_rgw = 0/0

282

283

284

RBD Tuning

285

To help achieve low latency on their RBD layer, we suggest the following, in addition to the CERN tuning referenced in ceph.com.

286

1) echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null

287

2) start each ceph-osd in dedicated cgroup with dedicated cpu cores (which should be free from any other load, even the kernel one like network interrupts)

288

3) increase “filestore_omap_header_cache_size” • “filestore_fd_cache_size” , for better caching (16MB for each 500GB of storage)

289

For disk entry in libvirt  put address to all three ceph monitors.

290

291

RGW: Rados Gateway Tuning

292

1. Disable usage/access log (since Cuttlefish)

293

rgw enable ops log = false

294

rgw enable usage log = false

295

log file = /dev/null

296

We find disabling usage/access log improves the performance.

297

2. Using large cache size (since Cuttlefish)

298

rgw cache enabled = true

299

rgw cache lru size = 100000

300

Caching the hot objects improves the GET performance.

301

3. Using larger PG split/merge value.  (since Firefly)

302

filestore_merge_threshold = 500

303

filestore_split_multiple = 100

304

We find PG split/merge will introduce a big overhead. Using a large value would postpone the split/merge behavior. This will help the case where lots of small files are stored in the cluster.

305

4. Using load balancer with multiple RGW instances (since Cuttlefish)

306

307

We’ve found that the RGW has some scalability issues at present. With a single RGW instance the performance is poor. Running multiple RGW instances with a load balancer (e.g., Haproxy) will greatly improve the throughput.

308

5. Increase the number of Rados handlers (since Hammer)

309

Since Hammer it’s able to using multiple number of Rados handlers per RGW instances. Increasing this value should improve the performance.

310

6. Using Civetweb frontend (since Giant)

311

Before Giant, Apache + Libfastcgi were the recommended settings. However libfastcgi still use the very old ‘select’ mode, which is not able to handle large amount of concurrent IO in our testing. Using Civetweb frontend would help to improve the stability.

312

rgw frontends =civetweb port=80

313

314

7. Moving bucket index to SSD (since Giant)

315

Bucket index updating maybe some bottleneck if there’s millions of objects in one single bucket. We’ve find moving the bucket index to SSD storage will improve the performance.

316

317

8. Bucket Index Sharding (since Hammer)

318

We’ve find the bucket index sharding is a problem if there’s large amount of objects inside one bucket. However the index listing speed may be impacted.

319

320

Erasure Coding Tuning

321

1. Use larger stripe width

322

The default erasure code stripe size (4K) is not optimal, We find using a bigger value (64K) will reduce the CPU% a lot (10%+)

323

osd_pool_erasure_code_stripe_width = 65536

324

325

2. Use mid-sized K

326

For the Erasure Code algorithms, we find using some mid-sized K value would bring balanced results between throughput and CPU%. We recommend to use 10+4 or 8+2 mode

327

Appendix

328

329

Sample Ceph.conf

330

[global]

331

fsid = 35b08d01-b688-4b9a-947b-bc2e25719370

332

mon_initial_members = gw2

333

mon_host = 10.10.10.105

334

filestore_xattr_use_omap = true

335

auth_cluster_required = none

336

auth_service_required = none

337

auth_client_required = none

338

debug_lockdep = 0/0

339

debug_context = 0/0

340

debug_crush = 0/0

341

debug_buffer = 0/0

342

debug_timer = 0/0

343

debug_filer = 0/0

344

debug_objecter = 0/0

345

debug_rados = 0/0

346

debug_rbd = 0/0

347

debug_journaler = 0/0

348

debug_objectcatcher = 0/0

349

debug_client = 0/0

350

debug_osd = 0/0

351

debug_optracker = 0/0

352

debug_objclass = 0/0

353

debug_filestore = 0/0

354

debug_journal = 0/0

355

debug_ms = 0/0

356

debug_monc = 0/0

357

debug_tp = 0/0

358

debug_auth = 0/0

359

debug_finisher = 0/0

360

debug_heartbeatmap = 0/0

361

debug_perfcounter = 0/0

362

debug_asok = 0/0

363

debug_throttle = 0/0

364

debug_mon = 0/0

365

debug_paxos = 0/0

366

debug_rgw = 0/0

367

[mon]

368

mon_pg_warn_max_per_osd=5000

369

mon_max_pool_pg_num=106496

370

[client]

371

rbd cache = false

372

[osd]

373

osd mkfs type = xfs

374

osd mount options xfs = rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog

375

osd mkfs options xfs = -f -i size=2048

376

filestore_queue_max_ops=5000

377

filestore_queue_max_bytes = 1048576000

378

filestore_max_sync_interval = 10

379

filestore_merge_threshold = 500

380

filestore_split_multiple = 100

381

osd_op_shard_threads = 8

382

journal_max_write_entries = 5000

383

journal_max_write_bytes = 1048576000

384

journal_queueu_max_ops = 3000

385

journal_queue_max_bytes = 1048576000

386

ms_dispatch_throttle_bytes = 1048576000

387

objecter_inflight_op_bytes = 1048576000

388

public_network = 10.10.10.100/24

389

cluster_network = 10.10.10.100/24

390

391

[client.radosgw.gw2-1]

392

host = gw2

393

keyring = /etc/ceph/ceph.client.radosgw.keyring

394

rgw cache enabled = true

395

rgw cache lru size = 100000

396

rgw socket path = /var/run/ceph/ceph.client.radosgw.gw2-1.fastcgi.sock

397

rgw thread pool size = 256

398

rgw enable ops log = false

399

rgw enable usage log = false

400

log file = /dev/null

401

rgw frontends =civetweb port=80

402

rgw override bucket index max shards = 8

403

404

Sample sysctl.conf

405

fs.file-max = 6553600

406

net.ipv4.ip_local_port_range = 1024 65000

407

net.ipv4.tcp_fin_timeout = 20

408

net.ipv4.tcp_max_syn_backlog = 819200

409

net.ipv4.tcp_keepalive_time = 20

410

kernel.msgmni = 2878

411

kernel.sem = 256 32000 100 142

412

kernel.shmmni = 4096

413

net.core.rmem_default = 1048576

414

net.core.rmem_max = 1048576

415

net.core.wmem_default = 1048576

416

net.core.wmem_max = 1048576

417

net.core.somaxconn = 40000

418

net.core.netdev_max_backlog = 300000

419

net.ipv4.tcp_max_tw_buckets = 10000

420

421

All-NVMe Ceph Cluster Tuning for MySQL workload

422

Ceph.conf

423

[global]

424

        enable experimental unrecoverable data corrupting features = bluestore rocksdb

425

        osd objectstore = bluestore

426

        ms_type = async

427

        rbd readahead disable after bytes = 0

428

        rbd readahead max bytes = 4194304

429

        bluestore default buffered read = true

430

        auth client required = none

431

        auth cluster required = none

432

        auth service required = none

433

        filestore xattr use omap = true

434

        cluster network = 192.168.142.0/24, 192.168.143.0/24

435

        private network = 192.168.144.0/24, 192.168.145.0/24

436

        log file = /var/log/ceph/$name.log

437

        log to syslog = false

438

        mon compact on trim = false

439

        osd pg bits = 8

440

        osd pgp bits = 8

441

        mon pg warn max object skew = 100000

442

        mon pg warn min per osd = 0

443

        mon pg warn max per osd = 32768

444

        debug_lockdep = 0/0

445

        debug_context = 0/0

446

        debug_crush = 0/0

447

        debug_buffer = 0/0

448

        debug_timer = 0/0

449

        debug_filer = 0/0

450

        debug_objecter = 0/0

451

        debug_rados = 0/0

452

        debug_rbd = 0/0

453

        debug_ms = 0/0

454

        debug_monc = 0/0

455

        debug_tp = 0/0

456

        debug_auth = 0/0

457

        debug_finisher = 0/0

458

        debug_heartbeatmap = 0/0

459

        debug_perfcounter = 0/0

460

        debug_asok = 0/0

461

        debug_throttle = 0/0

462

        debug_mon = 0/0

463

        debug_paxos = 0/0

464

        debug_rgw = 0/0

465

        perf = true

466

        mutex_perf_counter = true

467

        throttler_perf_counter = false

468

        rbd cache = false

469

[mon]

470

        mon data =/home/bmpa/tmp_cbt/ceph/mon.$id

471

        mon_max_pool_pg_num=166496

472

        mon_osd_max_split_count = 10000

473

        mon_pg_warn_max_per_osd = 10000

474

[mon.a]

475

        host = ft02

476

        mon addr = 192.168.142.202:6789

477

[osd]

478

        osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog

479

        osd_mkfs_options_xfs = -f -i size=2048

480

        osd_op_threads = 32

481

        filestore_queue_max_ops=5000

482

        filestore_queue_committing_max_ops=5000

483

        journal_max_write_entries=1000

484

        journal_queue_max_ops=3000

485

        objecter_inflight_ops=102400

486

        filestore_wbthrottle_enable=false

487

        filestore_queue_max_bytes=1048576000

488

        filestore_queue_committing_max_bytes=1048576000

489

        journal_max_write_bytes=1048576000

490

        journal_queue_max_bytes=1048576000

491

        ms_dispatch_throttle_bytes=1048576000

492

        objecter_infilght_op_bytes=1048576000

493

        osd_mkfs_type = xfs

494

        filestore_max_sync_interval=10

495

        osd_client_message_size_cap = 0

496

        osd_client_message_cap = 0

497

        osd_enable_op_tracker = false

498

        filestore_fd_cache_size = 64

499

        filestore_fd_cache_shards = 32

500

        filestore_op_threads = 6

501

502

CBT YAML

503

cluster:

504

  user: "bmpa"

505

  head: "ft01"

506

  clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"]

507

  osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"]

508

  mons:

509

   ft02:

510

     a: "192.168.142.202:6789"

511

osds_per_node: 16

512

  fs: xfs

513

  mkfs_opts: '-f -i size=2048 -n size=64k'

514

  mount_opts: '-o inode64,noatime,logbsize=256k'

515

  conf_file: '/home/bmpa/cbt/ceph.conf'

516

  use_existing: False

517

  newstore_block: True

518

  rebuild_every_test: False

519

  clusterid: "ceph"

520

iterations: 1

521

  tmp_dir: "/home/bmpa/tmp_cbt"

522

pool_profiles:

523

    2rep:

524

      pg_size: 8192

525

      pgp_size: 8192

526

      replication: 2

527

benchmarks:

528

  librbdfio:

529

    time: 300

530

    ramp: 300

531

    vol_size: 10

532

    mode: ['randrw']

533

    rwmixread: [0,70,100]

534

    op_size: [4096]

535

    procs_per_volume: [1]

536

    volumes_per_client: [10]

537

    use_existing_volumes: False

538

    iodepth: [4,8,16,32,64,128]

539

    osd_ra: [4096]

540

    norandommap: True

541

    cmd_path: '/usr/local/bin/fio'

542

    pool_profile: '2rep'

543

    log_avg_msec: 250

544

545

MySQL configuration file (my.cnf)

546

[client]

547

port            = 3306

548

socket          = /var/run/mysqld/mysqld.sock

549

[mysqld_safe]

550

socket          = /var/run/mysqld/mysqld.sock

551

nice            = 0

552

[mysqld]

553

user            = mysql

554

pid-file        = /var/run/mysqld/mysqld.pid

555

socket          = /var/run/mysqld/mysqld.sock

556

port            = 3306

557

datadir         = /data

558

basedir         = /usr

559

tmpdir          = /tmp

560

lc-messages-dir = /usr/share/mysql

561

skip-external-locking

562

bind-address            = 0.0.0.0

563

max_allowed_packet      = 16M

564

thread_stack            = 192K

565

thread_cache_size       = 8

566

query_cache_limit       = 1M

567

query_cache_size        = 16M

568

log_error = /var/log/mysql/error.log

569

expire_logs_days        = 10

570

max_binlog_size         = 100M

571

performance_schema=off

572

innodb_buffer_pool_size = 25G

573

innodb_flush_method = O_DIRECT

574

innodb_log_file_size=4G

575

thread_cache_size=16

576

innodb_file_per_table

577

innodb_checksums = 0

578

innodb_flush_log_at_trx_commit = 0

579

innodb_write_io_threads = 8

580

innodb_page_cleaners= 16

581

innodb_read_io_threads = 8

582

max_connections = 50000

583

[mysqldump]

584

quick

585

quote-names

586

max_allowed_packet      = 16M

587

[mysql]

588

!includedir /etc/mysql/conf.d/

589

590

591

Sample Ceph Vendor Solutions

592

The following are pointers to Ceph solutions, but this list is not comprehensive:

593

https://www.dell.com/learn/us/en/04/shared-content~data-sheets~en/documents~dell-red-hat-cloud-solutions.pdf

594

http://www.fujitsu.com/global/products/computing/storage/eternus-cd/

595

http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA5-2799ENW.pdf http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA5-8638ENW.pdf

596

http://www.supermicro.com/solutions/storage_ceph.cfm

597

https://www.thomas-krenn.com/en/products/storage-systems/suse-enterprise-storage.html

598

http://www.qct.io/Solution/Software-Defined-Infrastructure/Storage-Virtualization/QCT-and-Red-Hat-Ceph-Storage-p365c225c226c230

599

600

Notices:

601

Copyright © 2016 Intel Corporation. All rights reserved

602

Intel, the Intel logo, Intel Atom, Intel Core, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

603

Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

604

Intel® Hyper-Threading Technology available on select Intel® Core™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.

605

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

606

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

607

608

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

609

610

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

611

612

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

613

614

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Project

General

Profile

Ceph

Tuning for All Flash Deployments » History » Version 7