Project

General

Profile

Actions

Bug #9469

closed

RadosGW performance degrades with high concurrency workload.

Added by pushpesh sharma over 9 years ago. Updated over 9 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am running COSbench as a performance benchmarking tool on a CEPH cluster(Swift API). Setup details are as follows:-

Each node is having
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
20 CPU Cores (40 with HT)
64GB of RAM
40GiB of NIC

3 Storage nodes (Each running 10 OSDs each, each OSD maps to a SSD, Journal on the same disk.)
3 Monitor nodes
Installed RadosGW on one of the monitors
1 Client with COSBench (Driver Controller on the same node)

My Ceph.conf looks like this:- ##################################################
[global]
fsid = 0d926618-e29a-4525-888a-c6758602258e
mon_initial_members = mon-1, mon-2, mon-3
mon_host = XX.XX.XX,XX,XX.XX.XX,XX,XX.XX.XX,XX
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
rgw_thread_pool_size = 500 #(started with default 100, but this value gives good throughput)
rgw_cache_enabled = false
debug_paxos = 0

[osd]
osd_disk_threads = 40
osd_op_threads = 40
osd_recovery_threads = 40
osd_max_backfills = 40
filestore_op_threads = 40
keyvalustore_op_threads = 40
mon_max_pool_pg_num = 65536
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 3
osd_pool_default_min_size = 2

[client.radosgw.gateway]
host = rack6-ramp-1
keyring = /etc/ceph/ceph.client.admin.keyring
rgw_socket_path = /tmp/radosgw.sock
log_file = /var/log/radosgw/radosgw.rack6-ramp-1.log
rgw_dns_name = rack6-ramp-1
rgw_ops_log_rados = true (temporary enabled for debugging)
rgw_enable_ops_log = true (temporary enabled for debugging) #############################################################

Workload details are as follows:
1. Containers=1000
2. Objects= 1000/per Container
3. Object Size: 4K
4. Operation ratios : 100% write followed by 100% read

I am not hitting any Memory,CPU or Network Limits, while performing test, so H/W is not limiting anything.I constantly monitored the H/W resources.

So my observations are two fold:

1. The success ratio which was absolutely 100% with COSbench's low worker threads(20-40), degrades to 99.5-99.8% after setting to worker threads count to 60. For a worker count of 200, success ratio for write request was as low as 50-60%. A poor success ratio of write requests ,also resulted in to related read request success ratios which is also 50-60%. So Write request got hit hard with higher concurrency.

2.Throughput for worker threads 60 was Write:549.95, Read:2219.74
A increase of ~16% () in worker threads resulted in 13% drop in Write performance, 36% drop in read performance, while the success ratio remained around 99.7% (so no change in impact of failed write request on read requests )

worker threads 70:
Write: 476.85 op/s, Read: 1412.15 op/s

I believe RadosGW have issues with higher concurrency, it is not able to utilize the H/W resources even they are available.
It might be possible I am doing something wrong,If I am missing some configurations please enlighten me.
I have not done any tuning on apcahe and FastCgi, using all defaults.

Actions

Also available in: Atom PDF