Project

General

Profile

Actions

Bug #9469

closed

RadosGW performance degrades with high concurrency workload.

Added by pushpesh sharma over 9 years ago. Updated over 9 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am running COSbench as a performance benchmarking tool on a CEPH cluster(Swift API). Setup details are as follows:-

Each node is having
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
20 CPU Cores (40 with HT)
64GB of RAM
40GiB of NIC

3 Storage nodes (Each running 10 OSDs each, each OSD maps to a SSD, Journal on the same disk.)
3 Monitor nodes
Installed RadosGW on one of the monitors
1 Client with COSBench (Driver Controller on the same node)

My Ceph.conf looks like this:- ##################################################
[global]
fsid = 0d926618-e29a-4525-888a-c6758602258e
mon_initial_members = mon-1, mon-2, mon-3
mon_host = XX.XX.XX,XX,XX.XX.XX,XX,XX.XX.XX,XX
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
rgw_thread_pool_size = 500 #(started with default 100, but this value gives good throughput)
rgw_cache_enabled = false
debug_paxos = 0

[osd]
osd_disk_threads = 40
osd_op_threads = 40
osd_recovery_threads = 40
osd_max_backfills = 40
filestore_op_threads = 40
keyvalustore_op_threads = 40
mon_max_pool_pg_num = 65536
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 3
osd_pool_default_min_size = 2

[client.radosgw.gateway]
host = rack6-ramp-1
keyring = /etc/ceph/ceph.client.admin.keyring
rgw_socket_path = /tmp/radosgw.sock
log_file = /var/log/radosgw/radosgw.rack6-ramp-1.log
rgw_dns_name = rack6-ramp-1
rgw_ops_log_rados = true (temporary enabled for debugging)
rgw_enable_ops_log = true (temporary enabled for debugging) #############################################################

Workload details are as follows:
1. Containers=1000
2. Objects= 1000/per Container
3. Object Size: 4K
4. Operation ratios : 100% write followed by 100% read

I am not hitting any Memory,CPU or Network Limits, while performing test, so H/W is not limiting anything.I constantly monitored the H/W resources.

So my observations are two fold:

1. The success ratio which was absolutely 100% with COSbench's low worker threads(20-40), degrades to 99.5-99.8% after setting to worker threads count to 60. For a worker count of 200, success ratio for write request was as low as 50-60%. A poor success ratio of write requests ,also resulted in to related read request success ratios which is also 50-60%. So Write request got hit hard with higher concurrency.

2.Throughput for worker threads 60 was Write:549.95, Read:2219.74
A increase of ~16% () in worker threads resulted in 13% drop in Write performance, 36% drop in read performance, while the success ratio remained around 99.7% (so no change in impact of failed write request on read requests )

worker threads 70:
Write: 476.85 op/s, Read: 1412.15 op/s

I believe RadosGW have issues with higher concurrency, it is not able to utilize the H/W resources even they are available.
It might be possible I am doing something wrong,If I am missing some configurations please enlighten me.
I have not done any tuning on apcahe and FastCgi, using all defaults.

Actions #1

Updated by pushpesh sharma over 9 years ago

Debugging further I was able to root cause the issue further. I enable debug logs for radosgw (20/20) , enabled access logs as well. Ran the same workload again, So here is how 2 GET request traversed:-
A successful requests transactions look like this:- ########################################
2014-09-19 15:05:55.255393 7fcc2eff5700 20 QUERY_STRING=page=swift&params=/v1/mycontainers199/myobjects504
2014-09-19 15:05:55.255397 7fcc2eff5700 20 REQUEST_URI=/swift/v1/mycontainers199/myobjects504
2014-09-19 15:05:55.255398 7fcc2eff5700 20 SCRIPT_NAME=/swift/v1/mycontainers199/myobjects504
2014-09-19 15:05:55.255399 7fcc2eff5700 20 SCRIPT_URI=http://rack6-ramp-1/swift/v1/mycontainers199/myobjects504
2014-09-19 15:05:55.255399 7fcc2eff5700 20 SCRIPT_URL=/swift/v1/mycontainers199/myobjects504
2014-09-19 15:05:55.255411 7fcc2eff5700 2 req 239364:0.000006::GET /swift/v1/mycontainers199/myobjects504::initializing
2014-09-19 15:05:55.255436 7fcc2eff5700 2 req 239364:0.000030:swift:GET /swift/v1/mycontainers199/myobjects504::getting op
2014-09-19 15:05:55.255438 7fcc2eff5700 2 req 239364:0.000033:swift:GET /swift/v1/mycontainers199/myobjects504:get_obj:authorizing
2014-09-19 15:05:55.255468 7fcc2eff5700 2 req 239364:0.000063:swift:GET /swift/v1/mycontainers199/myobjects504:get_obj:reading permissions
2014-09-19 15:04:24.716883 7fcbeaffd700 2 req 54129:0.008296:swift:GET /swift/v1/mycontainers199/myobjects504:get_obj:init op
2014-09-19 15:04:24.716887 7fcbeaffd700 2 req 54129:0.008300:swift:GET /swift/v1/mycontainers199/myobjects504:get_obj:verifying op mask
2014-09-19 15:04:24.716893 7fcbeaffd700 2 req 54129:0.008306:swift:GET /swift/v1/mycontainers199/myobjects504:get_obj:verifying op permissions
2014-09-19 15:04:24.717230 7fcbeaffd700 2 req 54129:0.008643:swift:GET /swift/v1/mycontainers199/myobjects504:get_obj:verifying op params
2014-09-19 15:04:24.717238 7fcbeaffd700 2 req 54129:0.008651:swift:GET /swift/v1/mycontainers199/myobjects504:get_obj:executing
2014-09-19 15:04:24.717250 7fcbeaffd700 20 get_obj_state: rctx=0x7fcbeaffc1d0 obj=mycontainers199:myobjects504 state=0x7fc9ec0098c8 s->prefetch_data=1
2014-09-19 15:04:24.717298 7fcbeaffd700 20 get_obj_state: rctx=0x7fcbeaffc1d0 obj=mycontainers199:myobjects504 state=0x7fc9ec0098c8 s->prefetch_data=1
2014-09-19 15:04:24.717308 7fcbeaffd700 20 get_obj_state: rctx=0x7fcbeaffc1d0 obj=mycontainers199:myobjects504 state=0x7fc9ec0098c8 s->prefetch_data=1
2014-09-19 15:04:24.717507 7fcbeaffd700 2 req 54129:0.008920:swift:GET /swift/v1/mycontainers199/myobjects504:get_obj:http status=200
A failed requests transactions look like this:-
2014-09-19 15:04:24.717387 7fcc257e2700 20 QUERY_STRING=page=swift&params=/v1/mycontainers186/myobjects801
2014-09-19 15:04:24.717412 7fcc257e2700 20 REQUEST_URI=/swift/v1/mycontainers186/myobjects801
2014-09-19 15:04:24.717428 7fcc257e2700 20 SCRIPT_NAME=/swift/v1/mycontainers186/myobjects801
2014-09-19 15:04:24.717430 7fcc257e2700 20 SCRIPT_URI=http://rack6-ramp-1/swift/v1/mycontainers186/myobjects801
2014-09-19 15:04:24.717431 7fcc257e2700 20 SCRIPT_URL=/swift/v1/mycontainers186/myobjects801
2014-09-19 15:04:24.717450 7fcc257e2700 2 req 54145:0.000009::GET /swift/v1/mycontainers186/myobjects801::initializing
2014-09-19 15:04:24.717476 7fcc257e2700 2 req 54145:0.000034:swift:GET /swift/v1/mycontainers186/myobjects801::getting op
2014-09-19 15:04:24.717480 7fcc257e2700 2 req 54145:0.000039:swift:GET /swift/v1/mycontainers186/myobjects801:get_obj:authorizing
2014-09-19 15:04:24.717527 7fcc257e2700 2 req 54145:0.000086:swift:GET /swift/v1/mycontainers186/myobjects801:get_obj:reading permissions *
Point to note here , there is nothing relevant in the logs, between these two messages. So it means this request didnt proceed any further , because read_permission() module timed out. Some other thread got priority and over the resource to read i.e over the '.users' or '.users.swift' pool. *
2014-09-19 15:04:24.725050 7fcc257e2700 2 req 54145:0.007608:swift:GET /swift/v1/mycontainers186/myobjects801:get_obj:http status=404 ############################################

Actions #2

Updated by Sage Weil over 9 years ago

  • Status changed from New to Rejected

please send an email about this to ceph-devel; that is a better forum to discuss performance issues.

Actions

Also available in: Atom PDF