Project

General

Profile

Actions

Support #37750

closed

Ceph unable to take full advantage of NVMe hardware capabilities

Added by yi li over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Tags:
ceph
Reviewed:
12/24/2018
Affected Versions:
Pull request ID:

Description

Hello

My ceph version is:
[root@ceph-master ~]# ceph -v
ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)

My OS version is:
[root@ceph-master ~]# cat /proc/version
Linux version 4.17.4-1.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)) #1 SMP Tue Jul 3 09:40:42 EDT 2018

I built a Ceph clusters of three OSD data nodes built by Intel optane 900p (three OSDs are located on one physical machine) to perform Ceph performance bottleneck test. The location of the Ceph bottleneck is mainly in the OSD layer, not the client, BlueStore storage layer. The test analysis is as follows:

Bare device performance benchmark:
The throughput limit of a single 900p NVMe is around 2GB/s.

Test 1:
When the number of replications is set to 1, using one client (on the same physical machine as the Ceph cluster), the test with three OSD servicing shows that the throughput of each NVMe is about 700MB/s, and the length of the IO queue is about 6. and the test with only one OSD servicing shows that each NVMe has a throughput of 1.5GB/s and an IO queue length of about 12(use iostat tool to get throughput and the length of IO queue).

Test 2:
When the number of replications is set to 1, only one OSD is servicing. When one client is pressed, the throughput of each NVMe is 1.5GB/s, and the length of the IO queue is about 6. When three clients are pressed concurrently.The throughput of each NVMe is 1.9GB/s, and the IO queue length can reach 300 or so.

Related test command:
iostat -x 1
rados bench -p test 10 write --no-cleanup

Conclusion:

The results of test 1 show that the Ceph performance bottleneck is not in BlueStore. When one OSD provides services, the throughput of BlueStore's underlying disk is basically close to the NVMe hardware bottleneck. The performance bottleneck may be the OSD layer or the client layer. Further, combined with the results of tests 2 and 3, when one OSD provides services and the number of clients is increased, and the cluster performance is improved; while the number of clients is increased and the three OSDs provide services, the cluster performance is even slightly decreased, indicating that the OSD layer is a performance bottleneck.

My question is:

Why is Ceph unable to take full advantage of NVMe hardware capabilities when there are 3 OSDs, and where are the bottlenecks?

And are there any improvement plans for you next?

Thank you very much !!!

Actions #1

Updated by Greg Farnum over 5 years ago

  • Status changed from New to Closed

You can go to the mailing list or search for presentations on NVMe performance for tuning advice. :)

Actions

Also available in: Atom PDF