Bug #23853
Inefficent implementation - very long query time for "rbd ls -l" queries
0%
Description
We are trying to integrate a storage repository in xenserver.
Summary:
The slowness is a real pain for us, because this prevents the xen storage repository to work efficently.
Gathering information for XEN Pools with hundreds of virtual machines (using "--format json") would be a real pain...
The high user time consumption and the really huge amount of threads suggests that there is something really inefficient in the "rbd" utility.
So what can i do to make "rbd ls -l" faster or to get comparable information regarding snapshot hierarchy information?
We have a strange problem on listing the images of a SSD pool.
# time rbd ls -l --pool RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
NAME SIZE PARENT FMT PROT LOCK
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81 20480M RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE 2
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE 2 yes
..
__srlock__ 0 2
real 0m8.726s
user 0m8.506s
sys 0m0.591s
===> This incredibly slow for outputting 105 lines.
Without "-l" ist pretty fast (unfortunately i need this information):
# time rbd ls --pool RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81
...
__srlock__
real 0m0.024s
user 0m0.015s
sys 0m0.008s
===> This incredibly fast for outputting 71 lines.
The "@BASE" snapshots are created by the following procedure:
rbd snap create
rbd snap protect
rbd clone
It seems that lookups to rbd-pools are performed by using a object named "rbd_directory" which resides on the pool...
Querying this object with 142 entries need 0.024seconds.
# time rados -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c listomapvals rbd_directory
id_12da262ae8944a
value (14 bytes) :
00000000 0a 00 00 00 5f 5f 73 72 6c 6f 63 6b 5f 5f |....__srlock__|
0000000e
id_12f2943d1b58ba
value (44 bytes) :
00000000 28 00 00 00 52 42 44 2d 31 62 34 32 31 38 39 31 |(...RBD-1b421891|
00000010 2d 34 34 31 63 2d 34 35 33 30 2d 62 64 66 33 2d |-441c-4530-bdf3-|
00000020 61 64 32 62 31 31 34 61 36 33 66 63 |ad2b114a63fc|
0000002c
...
real 0m0.024s
user 0m0.023s
sys 0m0.000s
I also analyzed the state of the OSD holding this object:
- ceph osd map RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c rbd_directory
osdmap e7400 pool 'RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c' (14) object 'rbd_directory' -> pg 14.30a98c1c (14.1c) -> up ([36,0,38], p36) acting ([36,0,38], p36)
Repeating queries resulted in 3.6% cpu usage - the logs do not provide any useful information.
Analyzing this command by strace, suggests me that there is something wrong with the "rbd" command implementation.
# strace -f -c rbd ls -l --pool RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c > /dev/null
strace: Process 50286 attached
strace: Process 50287 attached
strace: Process 50288 attached
strace: Process 50289 attached
strace: Process 50290 attached
strace: Process 50291 attached
strace: Process 50292 attached
....
!!! 2086 threads !!!
....
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.42 219.207328 2020 108534 20241 futex
1.47 3.265517 162 20099 epoll_wait
0.06 0.131473 3 46377 22894 read
0.02 0.053740 4 14627 sendmsg
0.01 0.017225 2 10020 write
0.00 0.008975 4 2001 munmap
0.00 0.007260 3 2170 epoll_ctl
0.00 0.007171 3 2088 madvise
0.00 0.003139 1 4166 rt_sigprocmask
0.00 0.002670 1 2086 prctl
0.00 0.002494 1 3381 mprotect
0.00 0.002315 1 2087 mmap
0.00 0.002120 1 2087 set_robust_list
0.00 0.002098 1 2084 gettid
0.00 0.001833 1 2086 clone
0.00 0.001152 8 136 87 connect
0.00 0.000739 7 102 close
0.00 0.000623 13 49 shutdown
0.00 0.000622 6 110 fcntl
0.00 0.000469 10 49 socket
0.00 0.000466 6 73 29 open
0.00 0.000456 9 49 setsockopt
0.00 0.000166 6 27 27 access
0.00 0.000164 4 37 fstat
0.00 0.000070 4 17 brk
0.00 0.000047 5 10 5 stat
0.00 0.000013 4 3 pipe
0.00 0.000012 4 3 epoll_create
0.00 0.000010 1 11 rt_sigaction
0.00 0.000007 7 1 lseek
0.00 0.000007 7 1 execve
0.00 0.000006 2 3 uname
0.00 0.000005 3 2 sysinfo
0.00 0.000005 5 1 arch_prctl
0.00 0.000005 5 1 membarrier
0.00 0.000003 3 1 1 ioctl
0.00 0.000003 3 1 getrlimit
0.00 0.000003 3 1 set_tid_address
0.00 0.000000 0 1 getuid
------ ----------- ----------- --------- --------- ----------------
100.00 222.720411 224582 43284 total
Our System:
- Luminous/12.2.5
- Ubuntu 16.04
- 5 OSD Nodes (24*8 TB HDD OSDs, 48*1TB SSD OSDS, Bluestore, 6Gb Cache Size per OSD, 192GB RAM, 56 HT CPUs)
- 3 Mons (64 GB RAM, 200GB SSD, 4 visible CPUs)
- 2 * 10 GBIT, SFP+, bonded xmit_hash_policy layer3+4
- the cluster is currently in preproduction state, there is no significant load
History
#1 Updated by Marc Schöchlin almost 6 years ago
If i invoke this without parallel thread execution i get the following result:
# time rbd ls -l --pool RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c --rbd_concurrent_management_ops=1
NAME SIZE PARENT
RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3 20480M RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE 2
__srlock__ 0 2
real 0m18.562s
user 0m12.513s
sys 0m0.793s
I attached the complete json output.
#2 Updated by Marc Schöchlin almost 6 years ago
According to this, latency between client and osd should not be the problem:
(according to the high amount of user time in the measure above, network communication should not be the problem)
Finding the involved osd:
# ceph osd map RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c rbd_directory
osdmap e7570 pool 'RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c' (14) object 'rbd_directory' -> pg 14.30a98c1c (14.1c) -> up ([36,0,38], p36) acting ([36,0,38], p36)
# ceph osd find osd.36
{
"osd": 36,
"ip": "10.23.27.149:6826/7195",
"crush_location": {
"host": "ceph-ssd-s39",
"root": "default"
}
}
Checking latencies:
<pre><code class="text">
ssh ceph-ssd-s39
# nuttcp -w1m ceph-mon-s43
11186.3391 MB / 10.00 sec = 9381.8890 Mbps 12 %TX 32 %RX 0 retrans 0.15 msRTT
# time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c --rbd_concurrent_management_ops=1 --id xen_test
NAME SIZE PARENT FMT PROT LOCK
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81 20480M RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE 2
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE 2 yes
...
RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3 20480M RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE 2
__srlock__ 0 2
real 0m23.667s
user 0m15.949s
sys 0m1.276s
# time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c --rbd_concurrent_management_ops=1 --id xen_test
NAME SIZE PARENT FMT PROT LOCK
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81 20480M RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE 2
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE 2 yes
...
RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3 20480M RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE 2
....
__srlock__ 0 2
real 0m13.937s
user 0m14.404s
sys 0m1.089s
</code></pre>
#3 Updated by Jason Dillaman almost 6 years ago
- Priority changed from Normal to Low
#4 Updated by Marc Schöchlin almost 6 years ago
- File perf.data.gz added
- File perf.data.txt.gz added
#5 Updated by Marc Schöchlin almost 6 years ago
Added perf data, created by :
apt-get install linux-tools-4.13.0-39-generic linux-cloud-tools-4.13.0-39-generic linux-tools-generic linux-cloud-tools-generic perf record -F 99 -g rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c --id xen_test perf report -n --stdio > perf.data.txt
#6 Updated by Jason Dillaman almost 6 years ago
- Status changed from New to Need More Info
@Marc: can you re-run this using the following the steps (making sure you have the debuginfo packages installed as well).
perf record -F 99 -g --call-graph dwarf rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c --id xen_test
perf script > out.script
Please attach the perf.data and out.script results.
#7 Updated by Marc Schöchlin almost 6 years ago
Sorry, the results exceed the maximum file upload limit:
I uploaded the file to my personal server:
https://www.256bit.org/nextcloud/index.php/s/npi7ppNpojbAj9k (will be removed after 25-07-2018)
Password: oChoo4Ey%
#8 Updated by Jason Dillaman almost 6 years ago
@Marc: can you double-check that you have the debuginfo packages installed? The perf graph (attached) shows the vast majority of the time is invoking the scheduler. Perhaps repeat the test but also append "--rbd-cache=false" to the command-line.
#9 Updated by Jason Dillaman almost 6 years ago
@Marc: any update?
#10 Updated by Marc Schöchlin almost 6 years ago
- File perf.data.txt View added
- File perf.data added
Sorry, i missed that mail.
I am currently on vacation until 9.6.2018.
Nevertheless i created a additional trace for you:
apt-get install linux-tools-4.13.0-39-generic linux-cloud-tools-4.13.0-39-generic linux-tools-generic linux-cloud-tools-generic ceph-common-dbg librbd1-dbg rbd-nbd-dbg librados2-dbg
perf record -F 99 -g rbd ls -l --pool RBD_XenStorage-TEST-SSD-07449252-bf96-4daa-b0a6-687b7f1c369c --rbd_concurrent_management_ops=1 --rbd-cache=false
perf report -n --stdio > perf.data.txt
#11 Updated by Jason Dillaman almost 6 years ago
@Marc: was it much faster w/ the RBD cache disabled? If so, this issue has been addressed in Mimic since we automatically disable the cache now on "rbd ls" (so you shouldn't see thousands of thread creations).
Also, make sure you run this to collect the perf:
perf record -F 99 -g --call-graph dwarf rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c --id xen_test perf script > out.script
#12 Updated by Marc Schöchlin almost 6 years ago
- File perf.data.gz added
- File out.script.gz added
No, enabling/disabling makes no difference. The execution times are identical.
I added the perf results.
# for i in `seq 1 10`; do time rbd ls -l -p RBD_XenStorage-TEST-SSD-07449252-bf96-4daa-b0a6-687b7f1c369c --rbd-cache=false >/dev/null; done
real 0m3.827s
user 0m3.697s
sys 0m0.143s
real 0m3.844s
user 0m3.649s
sys 0m0.238s
real 0m3.824s
user 0m3.677s
sys 0m0.220s
real 0m3.910s
user 0m3.764s
sys 0m0.190s
real 0m3.818s
user 0m3.604s
sys 0m0.236s
real 0m3.800s
user 0m3.639s
sys 0m0.196s
real 0m3.802s
user 0m3.672s
sys 0m0.147s
real 0m3.822s
user 0m3.660s
sys 0m0.185s
real 0m3.875s
user 0m3.660s
sys 0m0.234s
real 0m3.856s
user 0m3.704s
sys 0m0.214s
# for i in `seq 1 10`; do time rbd ls -l -p RBD_XenStorage-TEST-SSD-07449252-bf96-4daa-b0a6-687b7f1c369c --rbd-cache=true >/dev/null; done
real 0m3.878s
user 0m3.722s
sys 0m0.239s
real 0m3.908s
user 0m3.766s
sys 0m0.256s
real 0m3.908s
user 0m3.716s
sys 0m0.263s
real 0m3.871s
user 0m3.719s
sys 0m0.246s
real 0m3.886s
user 0m3.658s
sys 0m0.294s
real 0m3.877s
user 0m3.697s
sys 0m0.266s
real 0m3.926s
user 0m3.720s
sys 0m0.303s
real 0m3.972s
user 0m3.691s
sys 0m0.272s
real 0m3.834s
user 0m3.703s
sys 0m0.239s
real 0m3.851s
user 0m3.727s
sys 0m0.229s
#13 Updated by Jason Dillaman almost 6 years ago
I don't see any long CPU touch-time hotspots in your perf result. I can only imagine you are seeing latency artifacts since each image open requires numerous back-and-forth ops to the OSDs.
#14 Updated by Jason Dillaman over 5 years ago
- Status changed from Need More Info to Resolved