Bug #3286
librbd, kvm, async io hang
0%
Description
Fio hangs in a linux-2.6.32 vm on librbd when using direct and libaio, with ceph at c8721b956.
Libvirt configured with direct rbd access like:
<devices> <emulator>/usr/bin/kvm</emulator> <disk type='network' device='disk'> <driver name='qemu' type='raw'/> <auth username='test1'> <secret type='ceph' uuid='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'/> </auth> <source protocol='rbd' name='rbd/test1:rbd_cache=1:rbd_cache_size=67108864'> <host name='10.200.1.1' port='6789'/> </source> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk>
w/ an image of Debian Squeeze, kernel 2.6.32-5-amd64.
Using fio config:
# egrep -v '^(#|$)' /usr/share/doc/fio/examples/iometer-file-access-server [global] description=Emulation of Intel IOmeter File Server Access Pattern [iometer] bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10 rw=randrw rwmixread=80 direct=1 size=4g ioengine=libaio iodepth=64
...fio hangs consistently (tested 3 times with reboots between):
# fio /usr/share/doc/fio/examples/iometer-file-access-server Jobs: 1 (f=1): [m] [6.6% done] [0K/0K /s] [0/0 iops] [eta 08m:05s] ### hung here
...whereas this completes successfully:
fio <(sed -e's/direct=1/direct=0/' -e's/ioengine=libaio/ioengine=sync/' /usr/share/doc/fio/examples/iometer-file-access-server)
Associated revisions
ObjectCacher: fix off-by-one error in split
This error left a completion that should have been attached
to the right BufferHead on the left BufferHead, which would
result in the completion never being called unless the buffers
were merged before it's original read completed. This would cause
a hang in any higher level waiting for a read to complete.
The existing loop went backwards (using a forward iterator),
but stopped when the iterator reached the beginning of the map,
or when a waiter belonged to the left BufferHead.
If the first list of waiters should have been moved to the right
BufferHead, it was skipped because at that point the iterator
was at the beginning of the map, which was the main condition
of the loop.
Restructure the waiters-moving loop to go forward in the map instead,
so it's harder to make an off-by-one error.
Possibly-fixes: #3286
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
ObjectCacher: fix off-by-one error in split
This error left a completion that should have been attached
to the right BufferHead on the left BufferHead, which would
result in the completion never being called unless the buffers
were merged before it's original read completed. This would cause
a hang in any higher level waiting for a read to complete.
The existing loop went backwards (using a forward iterator),
but stopped when the iterator reached the beginning of the map,
or when a waiter belonged to the left BufferHead.
If the first list of waiters should have been moved to the right
BufferHead, it was skipped because at that point the iterator
was at the beginning of the map, which was the main condition
of the loop.
Restructure the waiters-moving loop to go forward in the map instead,
so it's harder to make an off-by-one error.
Possibly-fixes: #3286
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 2e862f4d183d8b57b43b0777737886f18f68bf00)
History
#1 Updated by Josh Durgin almost 11 years ago
Is your cluster entirely healthy?
If so, could you add :debug_rbd=20:debug_ms=1:log_file=/path/writeable/by/qemu.log to the libvirt rbd config and attach the resulting log file (you might need to explicitly allow qemu to write to that file in apparmor)?
#2 Updated by Chris Dunlop almost 11 years ago
# ceph health HEALTH_OK
qemu.log attached, after this sequence:
# virsh dumpxml smhtest1 | grep rbd <source protocol='rbd' name='rbd/smhtest1:rbd_cache=1:rbd_cache_size=67108864:debug_rbd=20:debug_ms=1:log_file=/tmp/qemu.log'> # virsh start smhtest1 # ssh smhtest1 # cd /tmp # time fio --output /tmp/fio1.out /usr/share/doc/fio/examples/iometer-file-access-server Jobs: 1 (f=1): [m] [9.9% done] [0K/0K /s] [0/0 iops] [eta 05m:19s] #### hung here
In another window:
# ssh smhtest1 # ps lw $(pidof fio) F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 0 0 1604 1588 20 0 28912 13048 - D+ pts/0 0:35 fio --output /tmp/fio1.out /usr/share/doc/fio/examples/iometer-file-access-server 1 0 1605 1604 20 0 36544 5892 - Ds ? 0:00 fio --output /tmp/fio1.out /usr/share/doc/fio/examples/iometer-file-access-server # date Thu Oct 11 12:42:23 EST 2012 # uptime 12:42:23 up 21 min, 2 users, load average: 9.78, 8.21, 5.28
#3 Updated by Josh Durgin almost 11 years ago
All the IOs in the log completed successfully, so I'm not sure what's causing these hangs. Does this work with the same guest using a raw image file instead of rbd?
#4 Updated by Chris Dunlop almost 11 years ago
Yes, it works when using a raw image file.
#5 Updated by Josh Durgin almost 11 years ago
So we'll need to dig deeper to figure out exactly what is hanging. There's a couple general possibilities - a deadlock in qemu/librbd, or some kind of hang in the guest. For the first, could you attach to the qemu process with gdb after the hang occurs, and post the output of 'thread apply all bt'?
The second possiblity is trickier to debug. Since the kernel page cache is being used, the actual block device requests are separate from the syscalls from fio.
Could you use blktrace in the guest to track what the underlying I/Os to the rbd device are from the guest's perspective?
#6 Updated by Josh Durgin almost 11 years ago
Err, I read that second part backwards - the hanging case is skipping the page cache, so an strace of the fio in the guest might be helpful.
#7 Updated by Chris Dunlop almost 11 years ago
- File strace.1543 added
- File strace.1544 added
- File strace.1548 added
- File strace.1550 added
Reducing the amount of data fio is testing with produces different behaviour. The default is size=4g. Below about size=100m I haven't seen it fail. Around 100m the success/failure is inconsistent. Using:
strace -o /tmp/strace -ff -tt fio --output /tmp/fio.out <(sed -e 's/size=4g/size=100m/' /usr/share/doc/fio/examples/iometer-file-access-server)
...the attached strace.15434 are from a successful run, and strace.15{48,50} from an unsuccessful run immediately afterwards. Unfortunately it looks like the interesting bit are stuck in strace's output buffer as strace.1550 is truncated.
Hmmm, I just rebooted the virtual and did the same tests. Once again, with size=100m, the first run succeeded and the second run failed.
After another reboot, I ran with size=10m 50 times without failure, then without a reboot, size=50m 30 times ok. Then, with size=90m, it got stuck with the % done oscillating from 86% up to 88% and the eta increasing. At that point the fio process was unkillable. I rebooted after several minutes of the same behaviour (the test would normally complete in a matter of seconds).
After a reboot, running with size=80m succeeded the first time and hung the second.
Hmmm, rdb_cache_size is set to 64m. Perhaps something to do with that? Another reboot, and ran this script:
for ((s=60; ; s+=1)) do for ((i=1; i<=10; i+=1)) do echo -e "\ns=$s i=$i" fio --output /tmp/fio.out <(sed -e "s/size=4g/size=${s}m/" /usr/share/doc/fio/examples/iometer-file-access-server) done done
This hung at "s=75 i=4". Another reboot, and it hung at "s=79 i=5".
Aha! I reduced rdb_cache_size to 32m, and after a reboot the script hung immediately at "s=60 i=1".
Another reboot, and ran the script again except with the start size to "s=30". This hung at "s=49 i=1".
Another reboot and start at "s=30", and hung at "s=48 i=1".
So it looks like some issue with the rbd_cache?
#8 Updated by Josh Durgin almost 11 years ago
Thanks for investigating so much. It certainly sounds like the issue is with the cache. Could you get a log with debug_rbd=20:debug_ms=1:debug_objectcacher=20?
Presumably setting rbd_cache=false would make the problem go away entirely.
#9 Updated by Chris Dunlop almost 11 years ago
- File qemu.log.bz2 added
Attached, output of:
# virsh dumpxml test1 | grep rbd <source protocol='rbd' name='rbd/test1:rbd_cache=1:rbd_cache_size=33554433:debug_rbd=20:debug_ms=1:debug_objectcacher=20:log_file=/tmp/qemu.log'> # virsh start test1 # ssh test1 # cd /tmp # fio --output /tmp/fio.out <(sed -e "s/size=4g/size=60m/" /usr/share/doc/fio/examples/iometer-file-access- server) Jobs: 1 (f=1): [m] [100.0% done] [0K/0K /s] [0/0 iops] [eta 00m:00s] ### hung here
With rbd_cache=false it's survived a few size=200m iterations, and the full size=4g is still running... will let you know how it goes.
#10 Updated by Chris Dunlop almost 11 years ago
FYI, 8 rounds of the full size=4g test passed without a hitch.
#11 Updated by Dan Mick almost 11 years ago
- Assignee set to Dan Mick
#12 Updated by Josh Durgin almost 11 years ago
- Assignee deleted (
Dan Mick)
Unfortunately I haven't been able to reproduce this myself or track down the cause yet. Would it be possible to get a blktrace of the fio in addition to the debug logs from rbd during the same run?
#13 Updated by Chris Dunlop almost 11 years ago
- File blktrace.dat.bz2 added
- File qemu.log.bz2 added
Josh Durgin wrote:
Unfortunately I haven't been able to reproduce this myself or track down the cause yet. Would it be possible to get a blktrace of the fio in addition to the debug logs from rbd during the same run?
Attached are the blktrace.dat and qemu.log from:
host.0# virsh dumpxml test1 | grep rbd <source protocol='rbd' name='rbd/test1:rbd_cache=1:rbd_cache_size=67108864:debug_rbd=20:debug_ms=1:log_file=/tmp/qemu.log'> host.0# virsh start test1 host.0# ssh test1 'blktrace -d /dev/vda1 -o -' > /tmp/blktrace.dat host.1# ssh test1 host.2# ssh test1 'fio <(sed -e 's/size=4g/size=100m/' /usr/share/doc/fio/examples/iometer-file-access-server)'
...then when the fio hangs, I use the shell on test1 to kill -TERM the blktrace:
test1.1# kill -TERM $(pidof blktrace) host.0# virsh destroy test1
#14 Updated by Josh Durgin almost 11 years ago
- Status changed from New to 7
- Assignee set to Josh Durgin
I fixed a few bugs that could cause this in the wip-oc-hang branch. This is now merged into the next and master branches. Could you try with the next branch on the client side? Updating librbd should be enough. Instructions for dev testing packages are here in case you don't want to build it yourself: http://ceph.com/docs/master/install/debian/#add-development-testing-packages
#15 Updated by Chris Dunlop almost 11 years ago
OK, with ceph next @ bc32fc42 (and rbd_cache_size=33554433) it completed the full fio test several times, in contrast to never completing previously.
I think that's got it!.
#17 Updated by bo zhou over 9 years ago
I met the same problem.
Fio hangs in a linux-3.15.0 when using direct and sync.
ceph version :0.80.1(release)