Bug #3286: librbd, kvm, async io hang - Ceph - Ceph

Actions

Copy link

Bug #3286

closed

librbd, kvm, async io hang

Added by Chris Dunlop over 11 years ago. Updated almost 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Josh Durgin

Category:

librbd

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Fio hangs in a linux-2.6.32 vm on librbd when using direct and libaio, with ceph at c8721b956.

Libvirt configured with direct rbd access like:

  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw'/>
      <auth username='test1'>
        <secret type='ceph' uuid='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'/>
      </auth>
      <source protocol='rbd' name='rbd/test1:rbd_cache=1:rbd_cache_size=67108864'>
        <host name='10.200.1.1' port='6789'/>
      </source>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>

w/ an image of Debian Squeeze, kernel 2.6.32-5-amd64.

Using fio config:

# egrep -v '^(#|$)' /usr/share/doc/fio/examples/iometer-file-access-server
[global]
description=Emulation of Intel IOmeter File Server Access Pattern
[iometer]
bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10
rw=randrw
rwmixread=80
direct=1
size=4g
ioengine=libaio
iodepth=64

...fio hangs consistently (tested 3 times with reboots between):

# fio /usr/share/doc/fio/examples/iometer-file-access-server
Jobs: 1 (f=1): [m] [6.6% done] [0K/0K /s] [0/0 iops] [eta 08m:05s]
    ### hung here

...whereas this completes successfully:

fio <(sed -e's/direct=1/direct=0/' -e's/ioengine=libaio/ioengine=sync/' /usr/share/doc/fio/examples/iometer-file-access-server)

Files

Download all files

qemu.log (68.3 MB) qemu.log		Chris Dunlop, 10/10/2012 06:55 PM
strace.1543 (1.53 MB) strace.1543		Chris Dunlop, 10/11/2012 05:41 PM
strace.1544 (5.56 MB) strace.1544		Chris Dunlop, 10/11/2012 05:41 PM
strace.1548 (64 KB) strace.1548		Chris Dunlop, 10/11/2012 05:41 PM
strace.1550 (5.59 MB) strace.1550		Chris Dunlop, 10/11/2012 05:41 PM
qemu.log.bz2 (2.96 MB) qemu.log.bz2		Chris Dunlop, 10/11/2012 07:39 PM
blktrace.dat.bz2 (1.53 MB) blktrace.dat.bz2		Chris Dunlop, 10/29/2012 03:27 PM
qemu.log.bz2 (1.83 MB) qemu.log.bz2		Chris Dunlop, 10/29/2012 03:27 PM

Actions

Copy link

Updated by Josh Durgin over 11 years ago

Is your cluster entirely healthy?

If so, could you add :debug_rbd=20:debug_ms=1:log_file=/path/writeable/by/qemu.log to the libvirt rbd config and attach the resulting log file (you might need to explicitly allow qemu to write to that file in apparmor)?

Actions

Copy link

Updated by Chris Dunlop over 11 years ago

File qemu.log qemu.log added

# ceph health
HEALTH_OK

qemu.log attached, after this sequence:

# virsh dumpxml smhtest1 | grep rbd
      <source protocol='rbd' name='rbd/smhtest1:rbd_cache=1:rbd_cache_size=67108864:debug_rbd=20:debug_ms=1:log_file=/tmp/qemu.log'>
# virsh start smhtest1
# ssh smhtest1
# cd /tmp
# time fio --output /tmp/fio1.out /usr/share/doc/fio/examples/iometer-file-access-server
Jobs: 1 (f=1): [m] [9.9% done] [0K/0K /s] [0/0 iops] [eta 05m:19s]
    #### hung here

In another window:

# ssh smhtest1
# ps lw $(pidof fio)
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
0     0  1604  1588  20   0  28912 13048 -      D+   pts/0      0:35 fio --output /tmp/fio1.out /usr/share/doc/fio/examples/iometer-file-access-server
1     0  1605  1604  20   0  36544  5892 -      Ds   ?          0:00 fio --output /tmp/fio1.out /usr/share/doc/fio/examples/iometer-file-access-server
# date
Thu Oct 11 12:42:23 EST 2012
# uptime
 12:42:23 up 21 min,  2 users,  load average: 9.78, 8.21, 5.28

Actions

Copy link

Updated by Josh Durgin over 11 years ago

All the IOs in the log completed successfully, so I'm not sure what's causing these hangs. Does this work with the same guest using a raw image file instead of rbd?

Actions

Copy link

Updated by Chris Dunlop over 11 years ago

Yes, it works when using a raw image file.

Actions

Copy link

Updated by Josh Durgin over 11 years ago

So we'll need to dig deeper to figure out exactly what is hanging. There's a couple general possibilities - a deadlock in qemu/librbd, or some kind of hang in the guest. For the first, could you attach to the qemu process with gdb after the hang occurs, and post the output of 'thread apply all bt'?

The second possiblity is trickier to debug. Since the kernel page cache is being used, the actual block device requests are separate from the syscalls from fio.
Could you use blktrace in the guest to track what the underlying I/Os to the rbd device are from the guest's perspective?

Actions

Copy link

Updated by Josh Durgin over 11 years ago

Err, I read that second part backwards - the hanging case is skipping the page cache, so an strace of the fio in the guest might be helpful.

Actions

Copy link Download all files

Updated by Chris Dunlop over 11 years ago

File strace.1543 strace.1543 added
File strace.1544 strace.1544 added
File strace.1548 strace.1548 added
File strace.1550 strace.1550 added

Reducing the amount of data fio is testing with produces different behaviour. The default is size=4g. Below about size=100m I haven't seen it fail. Around 100m the success/failure is inconsistent. Using:

strace -o /tmp/strace -ff -tt fio --output /tmp/fio.out <(sed -e 's/size=4g/size=100m/' /usr/share/doc/fio/examples/iometer-file-access-server)

...the attached strace.154³⁴ are from a successful run, and strace.15{48,50} from an unsuccessful run immediately afterwards. Unfortunately it looks like the interesting bit are stuck in strace's output buffer as strace.1550 is truncated.

Hmmm, I just rebooted the virtual and did the same tests. Once again, with size=100m, the first run succeeded and the second run failed.

After another reboot, I ran with size=10m 50 times without failure, then without a reboot, size=50m 30 times ok. Then, with size=90m, it got stuck with the % done oscillating from 86% up to 88% and the eta increasing. At that point the fio process was unkillable. I rebooted after several minutes of the same behaviour (the test would normally complete in a matter of seconds).

After a reboot, running with size=80m succeeded the first time and hung the second.

Hmmm, rdb_cache_size is set to 64m. Perhaps something to do with that? Another reboot, and ran this script:

for ((s=60; ; s+=1))
do
  for ((i=1; i<=10; i+=1))
  do
    echo -e "\ns=$s i=$i" 
    fio --output /tmp/fio.out <(sed -e "s/size=4g/size=${s}m/" /usr/share/doc/fio/examples/iometer-file-access-server)
  done
done

This hung at "s=75 i=4". Another reboot, and it hung at "s=79 i=5".

Aha! I reduced rdb_cache_size to 32m, and after a reboot the script hung immediately at "s=60 i=1".

Another reboot, and ran the script again except with the start size to "s=30". This hung at "s=49 i=1".

Another reboot and start at "s=30", and hung at "s=48 i=1".

So it looks like some issue with the rbd_cache?

Actions

Copy link

Updated by Josh Durgin over 11 years ago

Thanks for investigating so much. It certainly sounds like the issue is with the cache. Could you get a log with debug_rbd=20:debug_ms=1:debug_objectcacher=20?

Presumably setting rbd_cache=false would make the problem go away entirely.

Actions

Copy link

Updated by Chris Dunlop over 11 years ago

File qemu.log.bz2 qemu.log.bz2 added

Attached, output of:

# virsh dumpxml test1 | grep rbd
      <source protocol='rbd' name='rbd/test1:rbd_cache=1:rbd_cache_size=33554433:debug_rbd=20:debug_ms=1:debug_objectcacher=20:log_file=/tmp/qemu.log'>
# virsh start test1
# ssh test1
# cd /tmp
# fio --output /tmp/fio.out <(sed -e "s/size=4g/size=60m/" /usr/share/doc/fio/examples/iometer-file-access-
server)
Jobs: 1 (f=1): [m] [100.0% done] [0K/0K /s] [0/0 iops] [eta 00m:00s] 
     ### hung here

With rbd_cache=false it's survived a few size=200m iterations, and the full size=4g is still running... will let you know how it goes.

Actions

Copy link

#10

Updated by Chris Dunlop over 11 years ago

FYI, 8 rounds of the full size=4g test passed without a hitch.

Actions

Copy link

#11

Updated by Dan Mick over 11 years ago

Assignee set to Dan Mick

Actions

Copy link

#12

Updated by Josh Durgin over 11 years ago

Assignee deleted (~~Dan Mick~~)

Unfortunately I haven't been able to reproduce this myself or track down the cause yet. Would it be possible to get a blktrace of the fio in addition to the debug logs from rbd during the same run?

Actions

Copy link Download all files

#13

Updated by Chris Dunlop over 11 years ago

File blktrace.dat.bz2 blktrace.dat.bz2 added
File qemu.log.bz2 qemu.log.bz2 added

Josh Durgin wrote:

Unfortunately I haven't been able to reproduce this myself or track down the cause yet. Would it be possible to get a blktrace of the fio in addition to the debug logs from rbd during the same run?

Attached are the blktrace.dat and qemu.log from:

host.0# virsh dumpxml test1 | grep rbd
      <source protocol='rbd' name='rbd/test1:rbd_cache=1:rbd_cache_size=67108864:debug_rbd=20:debug_ms=1:log_file=/tmp/qemu.log'>
host.0# virsh start test1
host.0# ssh test1 'blktrace -d /dev/vda1 -o -' > /tmp/blktrace.dat
host.1# ssh test1
host.2# ssh test1 'fio <(sed -e 's/size=4g/size=100m/' /usr/share/doc/fio/examples/iometer-file-access-server)'

...then when the fio hangs, I use the shell on test1 to kill -TERM the blktrace:

test1.1# kill -TERM $(pidof blktrace)
host.0# virsh destroy test1

Actions

Copy link

#14

Updated by Josh Durgin over 11 years ago

Status changed from New to 7
Assignee set to Josh Durgin

I fixed a few bugs that could cause this in the wip-oc-hang branch. This is now merged into the next and master branches. Could you try with the next branch on the client side? Updating librbd should be enough. Instructions for dev testing packages are here in case you don't want to build it yourself: http://ceph.com/docs/master/install/debian/#add-development-testing-packages

Actions

Copy link

#15