Bug #6040
closed
Significant slowdown of osds since v0.67 Dumpling
Added by Oliver Daudey over 10 years ago.
Updated over 10 years ago.
Description
I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd and mds. I'm using RBD on this cluster as storage for KVM, CephFS is unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100+MB/sec on simple linear writes to a file with `dd' inside a VM on this cluster under regular load and the osds usually averaged 20-100% CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for the osds shot up to 100% to 400% in `top' (multi-core system) and the speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users complained that disk-access inside the VMs was significantly slower and the backups of the RBD-store I was running, also got behind quickly.
After downgrading only the osds to v0.61.7 Cuttlefish and leaving the rest at 0.67 Dumpling, speed and load returned to normal. I have repeated this performance-hit upon upgrade on a similar test-cluster under no additional load at all. Although CPU-usage for the osds wasn't as dramatic during these tests because there was no base-load from other VMs, I/O-performance dropped significantly after upgrading during these tests as well, and returned to normal after downgrading the osds.
I'm not sure what to make of it. There are no visible errors in the logs and everything runs and reports good health, it's just a lot slower, with a lot more CPU-usage.
Kernel: SMP Debian 3.2.46-1~bpo60+1 x86_64 GNU/Linux on Debian Squeeze.
QEMU/KVM: 1:1.1.2+dfsg-2~bpo60+1, recompiled with RBD-support.
Storage for osds on XFS, mount-opts: attr2,noquota,noatime,nodiratime
For completeness, the relevant part of "ceph.conf", the rest of which just defines a standard 3-node cluster, with mon, mds and osd on each node.
[global]
auth supported = cephx
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
[osd]
osd crush update on start = false
osd data = /mnt/data1/ceph/osd/$name
osd journal = /var/lib/ceph/osd/$name/journal
osd journal size = 10000
# uncomment the following line if you are mounting with ext4
# filestore xattr use omap = true
[mon]
mon data = /mnt/data1/ceph/mon/$cluster-$id
- Assignee set to Samuel Just
wip-dumpling-pglog-undirty may help with this.
- Priority changed from High to Urgent
merged wip-dumpling-pglog-undirty with the config set to false into next and dumpling.
- Status changed from New to Resolved
From ceph-users:
Hey Samuel,
I picked up 0.67.1-10-g47c8949 from the GIT-builder and the osd from
that seems to work great so far. I'll have to let it run for a while
longer to really be sure it fixed the problem, but it looks promising,
not taking any more CPU than the Cuttlefish-osds. Thanks! I'll get
back to you.
Regards,
Oliver
- Status changed from Resolved to Pending Backport
- Status changed from Pending Backport to Resolved
Also available in: Atom
PDF