Bug #17297
openhigh cpu usage for ceph-fuse (>150%)
0%
Description
Hi,
we noticed, that our CephFS deployment is very very slow. If we try to extract for instance kernel source under mounted cephfs with ceph-fuse, ceph-fuse process eats almost two cores and extraction takes ~5mins.
This is the output from sysdig, which shows that almost every fifth futex() is timed out. Any suggestion what should we look first to debug this slowness?
# sysdig 'proc.name=ceph-fuse and evt.latency > 10000000' -p '%evt.type --> %evt.latency.human)' futex --> 28ms) futex --> 5.00s) futex --> 14ms) futex --> 15ms) futex --> 11ms) futex --> 11ms) futex --> 5.00s) futex --> 11ms) futex --> 15ms) futex --> 14ms) futex --> 14ms) futex --> 16ms) futex --> 19ms) futex --> 5.00s)
Thank you.
Updated by Greg Farnum over 7 years ago
What version of ceph-fuse are you currently running? What config options have you set?
Do you have any evidence it's the futexes in particular which are taking up CPU time? Being "fast" I wouldn't default to it being time wasted there unless you have some evidence of it. :)
I think by default we are using kernel-enforced permissions; changing that may improve things (fuse_default_permissions = false) and is probably okay in Jewel (but isn't default, so your mileage may vary).
Updated by Donatas Abraitis over 7 years ago
ceph-fuse version:
# rpm -qa | grep ceph-fuse ceph-fuse-10.2.2-0.el7.x86_64
ceph-fuse process:
root 6793 2.0 0.1 2017400 377908 ? Sl 14:58 5:18 ceph-fuse --name=client.cephfuse-client.xxx.io /home -o nonempty,rw
/etc/ceph/ceph.conf (client section):
[client] fuse default permissions = 0 client acl type = posix_acl
0 == false?
Updated by Donatas Abraitis over 7 years ago
Greg Farnum, nothing is warned/noticed regarding "slow" in OSD logs, cluster status is HEALTH_OK, but slowness somehow is disappointing me. What would you recommend to take a look first?
Updated by Donatas Abraitis over 7 years ago
Just tried to disable quotas, but https://github.com/ceph/ceph/blob/a033dc6f5b4cef357db6f5951062d680e880ba0e/src/client/Client.cc#L12470 is hitting on every read/write still.. Or maybe ceph-fuse ignores [client] section from /etc/ceph/ceph.conf and needs run-time parameters?
Updated by Greg Farnum over 7 years ago
Well, that function aborts if quota is disabled; it still gets called into.
Anyway I tried it locally with linux-4.0.5.tar.xz and it took me 8 minutes on a vstart instance. I think that's just how long that many metadata queries take right now.
Updated by Donatas Abraitis over 7 years ago
# dd if=/dev/zero of=/home/testas/1G bs=1G count=1 oflag=direct 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 26.3153 s, 40.8 MB/s