Project

General

Profile

Actions

Bug #9513

closed

rbd_cache=true default setting is degading librbd performance ~10X in Giant

Added by Somnath Roy over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
giant,firefly,dumpling
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem.

1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR).
2. Single OSD is running Giant and client is running fio rbd on top of Giant based librbd/librados. For one client it is giving ~1.9K iops (4K RR).
3. Single OSD is running latest Giant and client is running Giant based ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K iops (4K RR).
4. Giant RGW on top of Giant OSD is also scaling.

So, it is obvious from the above that recent librbd has issues.

Setting rbd_cache = false in the conf file, I am getting back the old performance.


Files

Actions #1

Updated by Sage Weil over 9 years ago

  • Project changed from Ceph to rbd
  • Source changed from other to Community (dev)
Actions #2

Updated by Adam Crume over 9 years ago

  • Assignee set to Adam Crume
Actions #3

Updated by Adam Crume over 9 years ago

In my test cluster, I'm getting the same performance with "rbd cache = false" and "rbd cache = true". Could you post the ceph.conf and fio configuration? How big is the image you're reading from, and how sparse is it? Did you build ceph yourself? If so, what is the commit number?

Actions #4

Updated by Somnath Roy over 9 years ago

Please make sure you are following these steps..

1. Build the latest giant package both in cluster and client side. Do a 'make install' after that.

2. Be sure that client is loading the giant librbd/librados combination with the tool like lsof. FYI, if you have old libraries (firefly or even post firefly till some release that I have not identified) the client (librbd/librados) will be working seamlessly with giant cluster.

3. I have created and filled up an image of 110GB image, not much sparsed (~10G). Here is the details of the config

Image details :
--------------
root@emsserver1:/var/lib/ceph/osd/ceph-0# rbd info rbd2/ceph_regression_test3
rbd image 'ceph_regression_test3':
size 110 GB in 28160 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.191f74b0dc51
format: 2
features: layering
root@emsserver1:/var/lib/ceph/osd/ceph-0# rbd ls rbd2
ceph_regression_test3
root@emsserver1:/var/lib/ceph/osd/ceph-0#

ceph df :
---------

GLOBAL:
SIZE AVAIL RAW USED %RAW USED
1489G 1121G 367G 24.69
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 8000M 0.52 1121G 2000
data 1 85444k 0 1121G 20000
metadata 2 85444k 0 1121G 20000
seq_test 3 10277M 0.67 1121G 2307
rbd1 5 191G 12.83 1121G 51205
rbd2 6 100G 6.71 1121G 25603
big_obj 7 1061M 0.07 1121G 1

See rbd2 it is 100G filled up.

This is the result with setting rbd_cache = true:
------------------------------------------------

rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31187: Fri Oct 3 14:39:07 2014
read : io=260380KB, bw=4337.4KB/s, iops=1084, runt= 60032msec
slat (usec): min=10, max=1821, avg=34.82, stdev=26.61
clat (msec): min=1, max=68, avg=29.47, stdev= 9.73
lat (msec): min=1, max=68, avg=29.50, stdev= 9.71
clat percentiles (usec): | 1.00th=[ 3632], 5.00th=[ 3952], 10.00th=[ 4448], 20.00th=[31360], | 30.00th=[32128], 40.00th=[32384], 50.00th=[33024], 60.00th=[33024], | 70.00th=[33536], 80.00th=[34048], 90.00th=[34560], 95.00th=[35072], | 99.00th=[37632], 99.50th=[39168], 99.90th=[46848], 99.95th=[49920], | 99.99th=[58624]
bw (KB /s): min= 3608, max=30904, per=100.00%, avg=4345.21, stdev=3330.47
lat (msec) : 2=0.01%, 4=5.83%, 10=6.74%, 20=0.02%, 50=87.36%
lat (msec) : 100=0.05%
cpu : usr=5.89%, sys=3.03%, ctx=366714, majf=0, minf=21765
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=4.1%, 32=95.8%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=65095/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=260380KB, aggrb=4337KB/s, minb=4337KB/s, maxb=4337KB/s, mint=60032msec, maxt=60032msec

Disk stats (read/write):
sda: ios=0/78, merge=0/79, ticks=0/96, in_queue=96, util=0.11%

Cluster side you can see most of the traffic is going, so, not much cache hit in the client side :
-----------------------------------------------------------------------------------------------
root@emsclient:/var/lib/ceph/osd/ceph-0# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_WARN mon.a low disk space
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e624: 1 osds: 1 up, 1 in
pgmap v407796: 832 pgs, 7 pools, 310 GB data, 118 kobjects
367 GB used, 1121 GB / 1489 GB avail
832 active+clean
client io 3913 kB/s rd, 978 op/s

This is the result with setting rbd_cache = false:
------------------------------------------------

rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [r(1)] [100.0% done] [33108KB/0KB/0KB /s] [8277/0/0 iops] [eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32323: Fri Oct 3 14:42:44 2014
read : io=2440.1MB, bw=41657KB/s, iops=10414, runt= 60003msec
slat (usec): min=10, max=610, avg=34.53, stdev=19.91
clat (usec): min=569, max=80639, avg=3020.97, stdev=666.91
lat (usec): min=583, max=80664, avg=3055.50, stdev=673.95
clat percentiles (usec): | 1.00th=[ 1992], 5.00th=[ 2224], 10.00th=[ 2352], 20.00th=[ 2512], | 30.00th=[ 2640], 40.00th=[ 2768], 50.00th=[ 2928], 60.00th=[ 3120], | 70.00th=[ 3312], 80.00th=[ 3504], 90.00th=[ 3760], 95.00th=[ 4048], | 99.00th=[ 4768], 99.50th=[ 5344], 99.90th=[ 5792], 99.95th=[ 5984], | 99.99th=[13120]
bw (KB /s): min=25456, max=52960, per=100.00%, avg=41697.41, stdev=6167.31
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=1.07%, 4=93.50%, 10=5.41%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=31.78%, sys=6.69%, ctx=464331, majf=0, minf=422
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.4%, 32=66.6%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=624890/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=2440.1MB, aggrb=41657KB/s, minb=41657KB/s, maxb=41657KB/s, mint=60003msec, maxt=60003msec

Disk stats (read/write):
sda: ios=0/67, merge=0/37, ticks=0/68, in_queue=68, util=0.02%

Again, most of the traffic is going to cluster, so, not much cache hit in the client side :
-----------------------------------------------------------------------------------------------

root@emsclient:/var/lib/ceph/osd/ceph-0# ceph -s
cluster 94991097-7638-4240-b922-f525300a9026
health HEALTH_WARN mon.a low disk space
monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, quorum 0 a
osdmap e624: 1 osds: 1 up, 1 in
pgmap v407801: 832 pgs, 7 pools, 310 GB data, 118 kobjects
367 GB used, 1121 GB / 1489 GB avail
832 active+clean
client io 40162 kB/s rd, 10040 op/s

So, you can see it's ~10x degradation with rbd_cache = true/false.

Here is the fio_rbd config:
--------------------------

[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=rbd
clientname=admin
pool=rbd2
rbdname=ceph_regression_test3
invalidate=0 # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=1m
size=100G
numjobs=1
[rbd_iodepth32]
iodepth=32

Ceph.conf client side:
----------------------

[global]

filestore_xattr_use_omap = true
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
osd_op_threads = 2
osd_op_num_threads_per_shard = 1
osd_op_num_shards = 25
#osd_op_num_sharded_pool_thread = 25
filestore_op_threads = 4
ms_nocrc = true
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
cephx sign messages = false
cephx require signatures = false
ms_dispatch_throttle_bytes = 0
throttler_perf_counter = false
#ms_tcp_nodelay = false
osd_pool_default_size = 1
osd_pool_default_min_size = 1
rbd_cache = false

[osd]

osd_journal_size = 150000
  1. Execute $ hostname to retrieve the name of your host,
  2. and replace {hostname} with the name of your host.
  3. For the monitor, replace {ip-address} with the IP
  4. address of your host.
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false
#osd objectstore = fdfstore

Ignore these osd/cluster side settings..I am using the same conf file both sides except rbd_cache = true/false

Let me know if you need further information.

Actions #5

Updated by Adam Crume over 9 years ago

Thanks for the details. Unfortunately, I'm still unable to reproduce the issue. Was your cluster created in Firefly (or before) and then upgraded to Giant, or was it created in Giant? Could you change debug_rbd, debug_objectcacher, and debug_rados to 20 (note that it should be "cacher", not "catcher") and re-run with rbd_cache enabled and disabled and attach the client logs?

It's interesting that the CPU usage is much lower when caching is enabled, even though the latency is much higher. That suggests to me that latency might coming from the OSD for some reason. Don't know why the OSD's latency would change based on the rbd_cache setting, though...

Updated by Somnath Roy over 9 years ago

Hopefully, you made sure the librbd/librados libraries fio_rbd is loading are from the giant. As I said, replacing the libraries with firefly I am not seeing the issue.
My single node (single osd) dev server is built from the latest giant. I don't think any problem in the cluster side, the reason probably low cpu usage in the client side is some kind of serialization going on in the librbd code base with rbd_cache = true. Please find the debug files you requested attached here.

Actions #7

Updated by Adam Crume over 9 years ago

I double-checked, and I had multiple versions of librbd on my path. (I forgot about installing one of them.) I removed the incorrect version, and now I'm able to reproduce the issue. (Sorry about that.)

One thing that stands out is many entries like this:

2014-10-06 13:00:39.069946 7f1b58ef8700 10 objectcacher readx missed, waiting on cache to free 12288 bytes
2014-10-06 13:00:39.069949 7f1b58ef8700 20 objectcacher readx defer 0x7f1b5c60b2d0

that show up when caching is enabled in Giant but not in Firefly. I don't know yet if it's a cause or a symptom, but it's a lead.

Actions #8

Updated by Somnath Roy over 9 years ago

Good to know that you are able to reproduce this.
I think the log entries you mentioned are there in Firefly as well. But, the difference with Firefly is that in Giant the rbd_cache is by default enabled. I think you will end up with the same poor result in firefly as well if you set rbd_cache = true there.Probably, it's the caching implementation that serializing the request in client side.

Actions #9

Updated by Adam Crume over 9 years ago

  • Status changed from New to In Progress

The regression was introduced in commit 4fc9fffc494abedac0a9b1ce44706343f18466f1 (according to git bisect). This is the commit that introduced the "readx missed, waiting on cache to free XXX bytes" log entries quoted above. The problem is that the cache thinks it's full (when it's not), so it defers the read.

I have a fix committed, but it needs further testing before being merged.

Actions #10

Updated by Adam Crume over 9 years ago

  • Status changed from In Progress to Fix Under Review
Actions #11

Updated by Josh Durgin over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #12

Updated by Josh Durgin over 9 years ago

commit:82175ec94acc89dc75da0154f86187fb2e4dbf5e

Actions #13

Updated by Josh Durgin over 9 years ago

  • Status changed from Pending Backport to Resolved

backported in commit:65be257e9295619b960b49f6aa80ecdf8ea4d16a

Actions #14

Updated by Josh Durgin over 9 years ago

  • Status changed from Resolved to Pending Backport

reverted the backport for now as fully fixing the ObjectCacher is too large a change close to the giant release

Actions #15

Updated by Josh Durgin over 9 years ago

  • Backport changed from Giant to giant,firefly,dumpling
Actions #16

Updated by Jason Dillaman over 9 years ago

  • Status changed from Pending Backport to Resolved

Reverted in master commit:b808cdfaa8823f0747f78938f3ed9a7a75e9bed1
Reverted in Giant commit:3b1eafcabb6139133b5ff0bd79cb1cd35e28ceb7

Original ObjectCacher memory issue resolved under #9854

Actions #17

Updated by Ken Dreyer about 9 years ago

It sounds like this bug affected Firefly v0.80.8? I see "firefly" is listed in the Backport field, but I don't see a patch on the firefly git branch related to this. What am I missing?

Actions #18

Updated by Ken Dreyer about 9 years ago

To answer my question from above: this got resolved for firefly in #9854 (PR https://github.com/ceph/ceph/pull/3410)

Actions

Also available in: Atom PDF