Project

General

Profile

Bug #20054

librbd memory overhead when used with KVM

Added by Sebastian Nickel almost 7 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
we are running a jewel ceph cluster which serves RBD volumes for our KVM
virtual machines. Recently we noticed that our KVM machines use a lot more
memory on the physical host system than what they should use. We collect the
data with a python script which basically executes 'virsh dommemstat <virtual
machine name>'. We also verified the results of the script with the memory
stats of 'cat /proc/<kvm PID>/status' for each virtual machine and the results
are the same.

Here is an excerpt for one pysical host where all virtual machines are running
since 26th of April (virtual machine names removed):

"""
overhead actual percent_overhead rss
---------- -------- ------------------ ---------
1.4 GiB 1.0 GiB 141 2.4 GiB
0.0 MiB 1.0 GiB 0 950.5 MiB
1.4 GiB 1.0 GiB 141 2.4 GiB
0.0 MiB 2.0 GiB 0 1.8 GiB
986.9 MiB 1.0 GiB 96 2.0 GiB
0.0 MiB 4.0 GiB 0 2.9 GiB
723.7 MiB 2.0 GiB 35 2.7 GiB
0.0 MiB 8.0 GiB 0 4.8 GiB
947.3 MiB 1.0 GiB 92 1.9 GiB
2.5 GiB 1.0 GiB 250 3.5 GiB
3.0 GiB 2.0 GiB 151 5.0 GiB
2.5 GiB 2.0 GiB 123 4.5 GiB
54.5 MiB 2.0 GiB 2 2.1 GiB
1.2 GiB 1.0 GiB 124 2.2 GiB
2.9 GiB 1.0 GiB 286 3.9 GiB
2.1 GiB 2.0 GiB 104 4.1 GiB
2.2 GiB 1.0 GiB 220 3.2 GiB
1.6 GiB 1.0 GiB 163 2.6 GiB
1.9 GiB 2.0 GiB 94 3.9 GiB
2.2 GiB 1.0 GiB 223 3.2 GiB
1.5 GiB 1.0 GiB 148 2.5 GiB
934.6 MiB 2.0 GiB 45 2.9 GiB
1.3 GiB 1.0 GiB 134 2.3 GiB
192.4 MiB 1.0 GiB 18 1.2 GiB
939.4 MiB 2.0 GiB 45 2.9 GiB
"""

We are using the rbd client cache for our virtual machines, but it is set to
only 128MB per machine. There is also only one rbd volume per virtual machine.
We have seen more than 200% memory overhead per KVM machine on other physical
machines as well. After a live migration of the virtual machine to another host the
overhead is back to 0 and increasing slowly back to high values.

Here are our ceph.conf settings for the clients:
"""
[client]
rbd cache writethrough until flush = False
rbd cache max dirty = 100663296
rbd cache size = 134217728
rbd cache target dirty = 50331648
"""

We noticed this behavior since we are using the jewel librbd libraries. We did
not encounter this behavior when using the ceph infernalis librbd version. We
also do not see this issue when using local storage, instead of ceph (there we can see an overhead of about 10%).

Some version information of the physical host which runs the KVM machines:
"""
OS: Ubuntu 16.04
kernel: 4.4.0-75-generic
librbd: 10.2.7-1xenial
"""

We did try to flush and invalidate the client cache via the ceph admin socket,
but this did not change any memory usage values.

I was also able to reproduce the issue with a test VM and the help of 'fio':

"""
  1. This job file tries to mimic the Intel IOMeter File Server Access Pattern
    [global]
    description=Emulation of Intel IOmeter File Server Access Pattern
    randrepeat=0
    filename=/root/test.dat
  2. IOMeter defines the server loads as the following:
  3. iodepth=1 Linear
  4. iodepth=4 Very Light
  5. iodepth=8 Light
  6. iodepth=64 Moderate
  7. iodepth=256 Heavy
    iodepth=8
    size=80g
    direct=0
    ioengine=libaio

[iometer]
stonewall
bs=4M
rw=randrw

[iometer_just_write]
stonewall
bs=4M
rw=write

[iometer_just_read]
stonewall
bs=4M
rw=read
"""

Then let I it run for some time via:
$> while true; do fio stress.fio; rm /root/test.dat; done

I was able to get an overhead of 85% (2.5 GB for a 3GB VM).

Is this overhead non avoidable or could be minimized somehow?

20054-ceph-qemu-rbd-overhead.xlsx (11 KB) Christian Theune, 12/12/2017 11:34 AM

20054-ceph-qemu-rbd-overhead.pdf (29.2 KB) Christian Theune, 12/12/2017 11:34 AM

direct_1_cache_off_alloc_space.txt View (5.94 KB) Li Yichao, 03/23/2018 01:38 PM

direct_1_cache_off_inuse_space.txt View (6.32 KB) Li Yichao, 03/23/2018 01:38 PM

direct_0_cache_on_inuse_space.txt View (6.84 KB) Li Yichao, 03/23/2018 01:38 PM

direct_0_cache_on_alloc_space.txt View (5.13 KB) Li Yichao, 03/23/2018 01:38 PM

direct_1_cache_on_inuse_space.txt View (6.77 KB) Li Yichao, 03/23/2018 01:38 PM

direct_1_cache_on_alloc_space.txt View (5.62 KB) Li Yichao, 03/23/2018 01:38 PM

History

#1 Updated by Sebastian Nickel almost 7 years ago

Sorry, looks like I got some formatting issues there. Here again the overhead table:

overhead     actual       percent_overhead  rss
----------   --------   ------------------  ---------
1.4 GiB      1.0 GiB                   141  2.4 GiB
0.0 MiB      1.0 GiB                     0  950.5 MiB
1.4 GiB      1.0 GiB                   141  2.4 GiB
0.0 MiB      2.0 GiB                     0  1.8 GiB
986.9 MiB    1.0 GiB                    96  2.0 GiB
0.0 MiB      4.0 GiB                     0  2.9 GiB
723.7 MiB    2.0 GiB                    35  2.7 GiB
0.0 MiB      8.0 GiB                     0  4.8 GiB
947.3 MiB    1.0 GiB                    92  1.9 GiB
2.5 GiB      1.0 GiB                   250  3.5 GiB
3.0 GiB      2.0 GiB                   151  5.0 GiB
2.5 GiB      2.0 GiB                   123  4.5 GiB
54.5 MiB     2.0 GiB                     2  2.1 GiB
1.2 GiB      1.0 GiB                   124  2.2 GiB
2.9 GiB      1.0 GiB                   286  3.9 GiB
2.1 GiB      2.0 GiB                   104  4.1 GiB
2.2 GiB      1.0 GiB                   220  3.2 GiB
1.6 GiB      1.0 GiB                   163  2.6 GiB
1.9 GiB      2.0 GiB                    94  3.9 GiB
2.2 GiB      1.0 GiB                   223  3.2 GiB
1.5 GiB      1.0 GiB                   148  2.5 GiB
934.6 MiB    2.0 GiB                    45  2.9 GiB
1.3 GiB      1.0 GiB                   134  2.3 GiB
192.4 MiB    1.0 GiB                    18  1.2 GiB
939.4 MiB    2.0 GiB                    45  2.9 GiB

and the fio file used:

# This job file tries to mimic the Intel IOMeter File Server Access Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern
randrepeat=0
filename=/root/test.dat
# IOMeter defines the server loads as the following:
# iodepth=1     Linear
# iodepth=4     Very Light
# iodepth=8     Light
# iodepth=64    Moderate
# iodepth=256   Heavy
iodepth=8
size=80g
direct=0
ioengine=libaio

[iometer]
stonewall
bs=4M
rw=randrw

[iometer_just_write]
stonewall
bs=4M
rw=write

[iometer_just_read]
stonewall
bs=4M
rw=read

#2 Updated by Brendan Moloney almost 7 years ago

I am seeing the same issue with my Jewel cluster. Plenty of VMs with over 100% memory overhead.

Can we get the priority bump on this issue? Using Ceph RBD for VMs seems like one of the most common use cases, and this is a major problem for those users.

#3 Updated by Jason Dillaman almost 7 years ago

  • Status changed from New to Need More Info

If anyone can provide an example job reproducing this with fio utilizing the direct rbd engine (i.e. take QEMU out-of-the-loop), it would be greatly appreciated. Include OS, librbd, and fio versions for reproducibility. Thanks.

#4 Updated by Sebastian Nickel almost 7 years ago

I tried to reproduce this issue with 'fio' (no qemu in the loop) over the weekend, but I was not able to get the same result as with qemu-kvm. fio was constantly using almost the same amount of memory.

I was using the following fio file:
```
  1. This job file tries to mimic the Intel IOMeter File Server Access Pattern
    [global]
    randrepeat=0
  2. IOMeter defines the server loads as the following:
  3. iodepth=1 Linear
  4. iodepth=4 Very Light
  5. iodepth=8 Light
  6. iodepth=64 Moderate
  7. iodepth=256 Heavy
    iodepth=8
    size=80g
    direct=0
    ioengine=rbd

clustername=ceph
rbdname=<rbd name>
pool=<pool name>
clientname=<client>

[iometer_randrw_1]
time_based
runtime=720000
bs=4M
rw=randrw

[iometer_randrw_2]
time_based
runtime=720000
bs=8M
rw=randrw

[iometer_just_read_1]
time_based
runtime=720000
bs=4M
rw=read

[iometer_just_read_2]
time_based
runtime=720000
bs=8M
rw=read
```

This fio config file uses threads to accomplish the tasks. I am not sure if 'iodepth=8' got taken into account by fio as I am not aware if the 'rbd' engine supports this. The whole fio process used around 2.5GB for the whole time (spiking sometimes to 2.8GB, but going down as well). The more "tasks" I added to the fio file, the more memory was used, but I think this is normal as every thread might use its own client cache. I was using again 128MB client cache configured in ceph.conf.

This was tested with librbd on a Ubuntu Xenial Host (librbd package version 10.2.7-0ubuntu0.16.04.1).

Am I testing something wrong here or is the whole issue only related to qemu when using librbd?

So am I doing something wrong

#5 Updated by Sebastian Nickel almost 7 years ago

Sorry, again the formatting issue. Here the fio file again:

# This job file tries to mimic the Intel IOMeter File Server Access Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern
randrepeat=0
# IOMeter defines the server loads as the following:
# iodepth=1     Linear
# iodepth=4     Very Light
# iodepth=8     Light
# iodepth=64    Moderate
# iodepth=256   Heavy
iodepth=8
size=80g
direct=0
ioengine=rbd

clustername=ceph
rbdname=<rbd name>
pool=<pool name>
clientname=<client>

[iometer_randrw_1]
time_based
runtime=720000
bs=4M
rw=randrw

[iometer_randrw_2]
time_based
runtime=720000
bs=8M
rw=randrw

[iometer_just_read_1]
time_based
runtime=720000
bs=4M
rw=read

[iometer_just_read_2]
time_based
runtime=720000
bs=8M
rw=read

#6 Updated by Jason Dillaman almost 7 years ago

@Sebastian: Yes, the rbd engine definitely uses the iodepth setting. If running multiple jobs against the same image, just make sure you have the exclusive-lock feature disabled since otherwise your IO will be painfully slow as it ping-pongs the lock around the clients.

#7 Updated by Sebastian Nickel over 6 years ago

@Jason: I did the test again with an image without exclusive-lock and object map. I also changed iodepth to 16. I got higher throughput and my fio process used now 2.8 GB RSS memory all the time (spiking to 3GB sometimes). I then went on and disabled the librbd cache via /etc/ceph/ceph.conf. I ran the fio test again and now my fio process used around 1.8GB all the time (almost no spikes this time). This is 1 GB of difference. I configured 128MB client cache in ceph.conf. Having 4 parallel jobs my fio process should use 512MB for caches. This leaves me with 512MB more usage than expected...

Still, it looks like the memory usage stays the same all the time, which is different than what I see when using qemu-kvm with rbd (steadily growing there).

#8 Updated by Sebastian Nickel over 6 years ago

As I can not reproduce the high memory usage when using fio directly, I created a bug report on the qemu side ([1]). I tested qemu 2.9 yesterday as well and noticed the same issue.

[1] https://bugs.launchpad.net/qemu/+bug/1701449

#9 Updated by Jason Dillaman over 6 years ago

@Sebastian: just to eliminate the guest OS as a possibility, would it be possible for you to re-run using qemu-tcmu or qemu-nbd to map an RBD image to a local block device and run the original libaio fio test.

#10 Updated by Sebastian Nickel over 6 years ago

@Jason: I did the same test with my RBD image mapped to the hypervisor via rbd-nbd (with ceph caching enabled). When executing the fio test in the VM the RSS value of the VM stays (exactly) the same all the time: 1396.38 MB (the VM has 3GB of memory). Even when I use all the memory in the VM (via stress m <number>) and run fio in parallel the RSS value of the VM (on the hypervisor) stays at 3144 MB and does not climb up. So that is how it should look like in my opinion :)

The rbd-nbd process varies with RSS usage between 600MB and 800MB during the fio test. This is a lot more than the 128MB it should use, because of the cache configuration, but it looks like that the memory consumption stays like this and does not grow.

#11 Updated by Sebastian Nickel over 6 years ago

Please ignore the strikethrough in the previous comment...formatting got me again :-)

#12 Updated by Sebastian Nickel over 6 years ago

Any news on this or anything I can do?

#13 Updated by Jason Dillaman over 6 years ago

@Sebastian: I need a reproducer for librbd to ensure that we are not chasing a QEMU issue.

#14 Updated by Christian Theune over 6 years ago

I think we're affected by this as well. @jason: in what form would a reproduction be helpful at this point? The Qemu people also seem baffled by this.

#15 Updated by Jason Dillaman over 6 years ago

@Christian: I need a repeatable/reproducible way to generate the high RSS. If it can be reproduced via qemu-nbd and fio, that would be the best since it eliminates the guest OS / virtualization stack as a possibility and pin-points the issue to QEMU block or librbd.

#16 Updated by Christian Theune over 6 years ago

I wasn't able to reproduce this at all. However. Here's something I just started doing with some help from a Qemu developer: I'm taking a core dump of one of the VMs that has insanely high memory usage and I'm starting to extract that various memory regions shown in /proc/*/smaps. We're definitely seeing a lot of RSS/PSS memory allocated besides the main memory allocation for the guest memory.

Starting to poke around, I do see a lot of largish (tens of mibs) regions where the content tells me that this is something Ceph related: stuff that looks like internal structures, maybe some logging data, and a lot of disk content from the VM. Many of those allocations are also actually used (VM vs RSS).

Could the content of those memory regions (I can't pass those on directly as they contain customer data) help you in any way to figure out what happens?

I'm going to count the Ceph-related regions and add them up later today or maybe tomorrow morning and get you some statistical data about that VM.

#17 Updated by Christian Theune over 6 years ago

@jason: Alright.

I extracted all memory regions from a running vm getting a coredump and looking at the smaps segments, I extracted all segments as individual files including the RSS metadata (compared to the VM size). I then took all files that had either 'librbd' or 'rbd_data' in them and moved them to a separate directory. What is noticable, is that all remaining files/allocations (aside from the main memory allocation for the guest) are really small in RSS (i.e. < 1.2mib). The allocations that include ceph data are about 750.648kib in ~60 allocations. Their RSS ranges from 12kb to 57mib each, the virtual allocation sizes are usually between 20mib and 60mib, around 5 items with VSS smaller than 10 mib.

I think this is a relatively good hint that there is too much memory used by Ceph and I'd be happy to provide you with as much information as you need to help diagnose this further.

#18 Updated by Jason Dillaman over 6 years ago

Christian: I really need a reproducer since w/o the heap profiling, I cannot determine where the allocations are occurring. Unless it's something trivial like thousands of threads (due to simple messenger and its two threads per connection ~2MB per thread), I would have no idea where to look.

#19 Updated by Christian Theune over 6 years ago

Dang! :)

We did some calculations yesterday and came up with an average rate of 200kib per 10 minutes of "leak". That's on average on VMs that have been running for around 127 days. I have tried a couple of things starting with various fio workloads, to also making sure our guest agent instrumentation isn't doing something weird that trips up things.

I checked the thread count of the affected machine, which has around 127 threads in total in the Qemu process. The connection count for Ceph is around 32 connections which means around 128MiB overhead. Interestingly does that mean that larger Ceph clusters will have an increasing memory overhead for each VM when the number of OSDs (in a pool) grows?

The last thing I can check right now is whether taking snapshots isn't something that causes an issue.

Do you have any idea what else I could try to poke the machine with to trigger anything? Having to wait for 127 days makes very slow debugging. ;)

Also, as I'm still running the (loudly) unsupported Hammer 0.94.10 (+ a patch to avoid snapshot inconsistency crashes) I did review all changes regarding to memory leaks on the client side and in librbd.

Two that I can't eliminate that might play a role (honestly I don't think so, but hey) could be:

https://github.com/ceph/ceph/commit/4388805feb57f30c5785987e6c8951b54ce88d1b
https://github.com/ceph/ceph/commit/69bedcd692fd505f63d58a016585de7edbc396ad

Any chance I can eliminate those on my cluster?

#20 Updated by Christian Theune over 6 years ago

So I tried freeze/thaw/snapshot/delete snapshot cycles while under heavy load from small IOs (4k) and large IOs (512k) those caused temporary spikes in RSS but got released again as well as I would expect from buffers filling and releasing. Sigh.

#21 Updated by Jason Dillaman over 6 years ago

@Christian:

Interestingly does that mean that larger Ceph clusters will have an increasing memory overhead for each VM when the number of OSDs (in a pool) grows?

For old releases of Ceph, yes. In Luminous, the new async messenger is made the default which eliminates the thread-per-connection behavior and instead utilizes a fixed-size pool of threads.

Any chance I can eliminate those on my cluster?

Anything in "src/client" is for CephFS and the other patch is for "async" messenger which isn't used in your (EOLed) release of Ceph.

#22 Updated by Christian Theune over 6 years ago

Alright. I managed to reproduce this with a reasonably simple setup. I had to use this within Qemu as fio on the host would be mixing up memory allocations for it's own data structures and running it inside a VM I was able to a) configure multiple Qemu parameters and b) separate the memory into the guest (running fio) and see what overhead came out.

I'm attaching an Excel file (and PDF as I exported it from Numbers) that shows memory parameters as well as test runs within the VM.

I did run fio from within the VM, after booting it into a GRML live migration. The VM has 4096 MiB memory and I exhausted that with multiple runs of a simple python program that allocated strings until it was killed by the OOM. I had a 10G raw image on the host and a 10g RBD file. RBD cache was always disabled.

After that I ran either one or multiple tests with fio similar to this (depending on the parameters):

fio --filename=/dev/vda --name=test --rw=write --bs=<varying> --iodepth=<varying> --direct=1 --numjobs=1

fio identified itself and the test with:

test: (g=0): rw=write, bs=50M-50M/50M-50M/50M-50M, ioengine=psync, iodepth=64
fio-2.16

What catches my eye is:

  • when having the qemu writeback cache enabled, RAM usage goes "through the roof" (and doesn't ever go down again)
  • this does not happen with writeback on a raw image
  • with writeback disabled, we still get a lot more overhead, but at a somewhat reasonable size (the cluster has 12 OSDs, so I would have expected the overhead to be much less. I assume 40Mib for Qemu (which doesn't even show for the raw image), 12*2*2*1024*1024 = 50 MiB for the messenger, no rbd cache. So 100MiB overhead seems fine to me. The last 3 tests also show that memory usage keeps rising with consecutive runs

So, here's a dump of all the setup I did to get this going.

Qemu (2.7), librbd 0.94.10:

$ cat >> sample.cfg <<__EOT__
[machine]
  type = "pc-i440fx-2.5" 
  accel = "kvm" 

[drive]
  index = "0" 
  media = "disk" 
  if = "virtio" 
  format = "rbd" 
  file = "rbd:rbd.hdd/sample:id=clancy" 
  aio = "native" 
  cache = "none" 

[drive]
  index = "1" 
  media = "disk" 
  if = "virtio" 
  format = "raw" 
  file = "/srv/test.raw" 
  aio = "native"" 
  cache = "none" 
__EOT__

$ rbd create --size 10000 rbd.hdd/fio_test
$ qemu-img create -f raw /srv/test.raw 10g
$ wget wget http://download.grml.org/grml64-full_2017.05.iso
$ qemu-system-x86_64   -m 4096 -smp 1 -machine pc-i440fx-2.5,accel=kvm -cdrom grml64-full_2017.05.iso  -readconfig sample.cfg -display vnc=172.20.3.61:1

I connected to the VNC and started grml with the default boot option and then pressed 'u' (for us keyboard) and 'q' for 'go to the shell'.

To touch as much guest memory inside the VM after booting I ran this:

$ python
x = []
while True:
    x.append("a" * 1024*1024)
<killed by oom>
x = []
while True:
    x.append("b" * 512*1024)
<killed by oom>

After that I took inventory of the Qemu process on the host, looking at /proc/<pid>/status and looked at the VmHWM and VmRSS fields. That's the value I listed in the "host baseline VmHWM" column.

After running the fio tests, I then listed the new VmHWM and VmRSS in the "after" columns.

Let me know if I can do anything else on this to help you reproduce. I'm also available for remote video conf sessions or similar if you want to poke my setup directly.

#24 Updated by Jason Dillaman over 6 years ago

@Christian: what does the "# of test in same VM" column represent? Also, FYI, when you configure QEMU in writeback mode, it will automatically enable the librbd in-memory cache [1].

[1] https://github.com/qemu/qemu/blob/398e6ad014df261d20d3fd6cff2cfbf940bac714/block/rbd.c#L642

#25 Updated by Christian Theune over 6 years ago

That # is whenever I ran multiple tests without killing the Qemu process or rebooting the guest but running the test upon the previous tests environment.

That implicit enablement is interesting, That still causes way much more data use than expected. The default cache should be only 32MiB, right?

#26 Updated by Jason Dillaman over 6 years ago

@Christian: The cache is zero-copy in that multiple (object size) extents can share a reference to the same backing memory allocation (50MB in this case). It's quite possible that individual 4MB (object size) sections are getting evicted from the cache, but as long as at least one extent of a given 50MB write remains, the 50MB allocation will remain in memory since the cache size accounting doesn't account for the raw buffer usage.

#27 Updated by Christian Theune over 6 years ago

That's interesting. That does sound like an almost DOSable exploit. Is the size of those allocations bounded in any way? Is there any way to "compact" this at runtime? Can we change the size of those 50MiB chunks?

I'll check the actual allocation patterns after one of the writeback-enabled tests.

This leads me to the question: does this mean writeback isn't usually used and we accidentally ran into an exotic configuration?

#28 Updated by Florian Haas over 6 years ago

@Christian, I'll leave it to Jason to continue the conversation about memory allocation, but I can answer this one:

This leads me to the question: does this mean writeback isn't usually used and we accidentally ran into an exotic configuration?

It's not exotic at all. Most OpenStack cloud environments run with disk_cachemodes="network=writeback" in nova.conf, which means they enable RBD writeback for all Nova-managed libvirt/KVM instances.

#29 Updated by Christian Theune over 6 years ago

@Florian: So I guess extreme memory overhead (in the range of multiple GiBs) hasn't hit you on a noticeable scale?

#30 Updated by Florian Haas over 6 years ago

Christian Theune wrote:

@Florian: So I guess extreme memory overhead (in the range of multiple GiBs) hasn't hit you on a noticeable scale?

Haven't run into that, no. Doesn't mean it doesn't ever happen, of course, just that I haven't seen it. See any difference between jemalloc and tcmalloc, perhaps?

#31 Updated by Christian Theune about 6 years ago

Quick update from my side: we started to disable the writeback cache through the appropriate Qemu block device option and have found RAM consumption to be stable and within the expected boundaries now. My guess would be that the described 50MiB allocation sizes plus fragmentation or incomplete freeing might have been an issue. For us this hasn't resulted in any noticable drawbacks from a performance perspective.

#32 Updated by Li Yichao about 6 years ago

I've done 3 experiments and think the overhead is not due to rbd cache.

  • Experiment is done based on the question description: run fio with the conf in a centos7 vm with qemu 2.6 againt luminous ceph, vm memory is 3GiB
  • For each experiment, I run google perftools to profile heap to see where memory is allocated. [[http://goog-perftools.sourceforge.net/doc/heap_profiler.html]]
  • In heap profile, --alloc-space is passed to pprof(alloc-space including spaces which have been released), because, virsh domemstat will show an almost ever growing rss. It seems in some cases even when memory is released, qemu will not return it to host. Anyway, I will also attach pprof with --alloc-space and the result --inuse-space.
  • Each experiment lasts for a short time ( < 10 min), because the memory goes up very quickly, or stays the same for a while without changing. A long experiment is also running, see descriptions below.
  • Experiments: # as in description, direct=0 and rbd cache on. domemstat quickly goes up to 3GiB, inuse space is 380MB and goes down to 110MB at the last sampling point. pprof shows 1.5G allocated at qemu_try_memalign, and 1.5G cost in librados.so.2.0.0. rbd_start_aio allocates 2252MB, which includes 1495MB qemu_try_blockalign and the copy in rbd_aio_write. So nearly no memory is costed in rbd cache. (only 1.19MB) # fio runs with direct=1 and rbd cache on. domemstat stays near 1GiB, inuse space is 372MB and goes down to 88MB at the last sampling point. # fio runs with direct=1 and rbd cache off. rbd cache is disabled by setting `rbd cache=false` in ceph.conf, qemu cache mode is default ( which is writeback). domemstat stays near 1GiB, inuse space is 400MB and goes down to 366MB at the last sampling point.

Because the original profile.00xx.heap file is 1MB and cannot be uploaded here, so I will upload the result pprof runs against the original file. Because gperftools generates profile file periodically, it is tedious to pprof all heap file, so I choose the file when the dommemstat first goes to the max value.

There remains two questions:
  • The value showed in `virsh domemstat` or in `/proc/$pid/status` is high, even though `free -h` seen in vm and heap profile for qemu shows small memory usage. Why?
  • The memory is holded by qemu, even when fio is stopped in vm and synced, and clear page cache (another small experiment), but heap dump of qemu shows small memory usage. So, where is the memory?

#33 Updated by Li Yichao about 6 years ago

Li Yichao wrote:

I've done 3 experiments and think the overhead is not due to rbd cache.

  • Experiment is done based on the question description: run fio with the conf in a centos7 vm with qemu 2.6 againt luminous ceph, vm memory is 3GiB
  • For each experiment, I run google perftools to profile heap to see where memory is allocated. [[http://goog-perftools.sourceforge.net/doc/heap_profiler.html]]
  • In heap profile, --alloc-space is passed to pprof(alloc-space including spaces which have been released), because, virsh domemstat will show an almost ever growing rss. It seems in some cases even when memory is released, qemu will not return it to host. Anyway, I will also attach pprof with --alloc-space and the result --inuse-space.
  • Each experiment lasts for a short time ( < 10 min), because the memory goes up very quickly, or stays the same for a while without changing. A long experiment is also running, see descriptions below.
  • Experiments:
    1. as in description, direct=0 and rbd cache on. domemstat quickly goes up to 3GiB, inuse space is 380MB and goes down to 110MB at the last sampling point. pprof shows 1.5G allocated at qemu_try_memalign, and 1.5G cost in librados.so.2.0.0. rbd_start_aio allocates 2252MB, which includes 1495MB qemu_try_blockalign and the copy in rbd_aio_write. So nearly no memory is costed in rbd cache. (only 1.19MB)
    2. fio runs with direct=1 and rbd cache on. domemstat stays near 1GiB, inuse space is 372MB and goes down to 88MB at the last sampling point.
    3. fio runs with direct=1 and rbd cache off. rbd cache is disabled by setting `rbd cache=false` in ceph.conf, qemu cache mode is default ( which is writeback). domemstat stays near 1GiB, inuse space is 400MB and goes down to 366MB at the last sampling point.

Because the original profile.00xx.heap file is 1MB and cannot be uploaded here, so I will upload the result pprof runs against the original file. Because gperftools generates profile file periodically, it is tedious to pprof all heap file, so I choose the file when the dommemstat first goes to the max value.

There remains two questions:
  • The value showed in `virsh domemstat` or in `/proc/$pid/status` is high, even though `free -h` seen in vm and heap profile for qemu shows small memory usage. Why?
  • The memory is holded by qemu, even when fio is stopped in vm and synced, and clear page cache (another small experiment), but heap dump of qemu shows small memory usage. So, where is the memory?

The long experiment is with rbd cache on and direct = 1 to see whether memory will go up after a long time.

#34 Updated by Jason Dillaman over 5 years ago

  • Status changed from Need More Info to Fix Under Review
  • Backport set to luminous,mimic

#35 Updated by Jason Dillaman over 5 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF