Bug #14988: QEMU VM hangs talking to RBD via librbd - rbd - Ceph

Actions

Copy link

Bug #14988

closed

QEMU VM hangs talking to RBD via librbd

Added by Roy Keene about 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Jason Dillaman

Target version:

% Done:

Source:

other

Tags:

librbd, qemu

Backport:

hammer, infernalis

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v0.94.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I'm having a repeatable hang in QEMU (v2.4.1) talking to a Ceph (v0.95.4) RBD via librbd.

I've acquired a core file and have all the binaries compiled with -g3 -ggdb3-level of debugging information.

Attached is a backtrace for all threads.

The VM process is still awake, but QEMU will not accept() the monitor socket, no network or RBD I/O happens.

This occurs after a few hours of running on this VM. The VM guest is a Microsoft Windows 2008 R2 connected to RBD via IDE emulation in QEMU.

Files

gdb-log.txt (37 KB) gdb-log.txt

Roy Keene, 03/05/2016 08:39 PM

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Nathan Cutler about 8 years ago

@Roy: Can you reproduce this on 0.94.6? If not, it is probably http://tracker.ceph.com/issues/13726

Actions

Copy link

Updated by Roy Keene about 8 years ago

I can try with a newer Ceph, but #13726 seems to be dependant upon RBD cache being disabled but it should be enabled for this VM:

Libvirt log:


2016-03-05 03:11:20.297+0000: starting up libvirt version: 1.2.21, qemu version: 2.4.1
LC_ALL=C PATH=/bin:/usr/bin QEMU_AUDIO_DRV=none /bin/qemu-system-x86_64 -name one-80 -S -machine pc-i440fx-2.4,accel=kvm,usb=off -m 8192 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 04898cdc-6332-4688-be9f-c92ba4661790 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-80/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=rbd:rbd/one-27-80-0:auth_supported=none,if=none,id=drive-ide0-0-0,format=raw,cache=writeback -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,fd=14,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=02:00:b2:fb:54:89,bus=pci.0,addr=0x3 -vnc 0.0.0.0:80 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -sandbox off -msg timestamp=on
qemu: terminating on signal 15 from pid 7419
2016-03-05 03:32:19.756+0000: shutting down

ceph.conf:

[client]
rbd cache = true
rbd cache size = 67108864
rbd cache max dirty age = 60

Is the caching being disabled an actual requirement for this bug, or just a coincidence for the reporter ?

I will upgrade and test but it's somewhat difficult since it's just this one VM that triggers it and it's only periodically

Actions

Copy link

Updated by Nathan Cutler about 8 years ago

@Roy: Upgrading is recommended anyway, but before you take that step you can try setting "rbd_non_blocking_aio" to false. If that doesn't make the problem go away, then #13726 is not at play.

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

Project changed from Ceph to rbd
Category deleted (~~qemu~~)

Thread 17 (LWP 30339):
#0  0x00007f3d978a6378 in futex_wait (private=0, expected=0, futex_word=0x55f2de63175c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=0, expected=0, futex_word=0x55f2de63175c) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_rwlock_wrlock_slow (rwlock=0x55f2de631750) at pthread_rwlock_wrlock.c:67
#3  0x00007f3d9cabd214 in RWLock::get_write (this=0x55f2de631748, lockdep=<optimized out>) at ./common/RWLock.h:96
#4  0x00007f3d9cab8745 in WLocker (lock=..., this=<synthetic pointer>) at ./common/RWLock.h:152
#5  librbd::ImageWatcher::reregister_watch (this=0x55f2de644240) at librbd/ImageWatcher.cc:1000
#6  0x00007f3d9ca98fda in operator() (a0=<optimized out>, this=<optimized out>)
    at /home/rkeene/devel/aurae/node/root/packages/ceph/workdir-129422584788648997066/.fakeroot/include/boost/function/function_template.hpp:767
#7  FunctionContext::finish (this=<optimized out>, r=<optimized out>) at ./include/Context.h:461
#8  0x00007f3d9ca91919 in Context::complete (this=0x7f3b5002f6b0, r=<optimized out>) at ./include/Context.h:65
#9  0x00007f3d9ca91919 in Context::complete (this=0x7f3b5001c880, r=<optimized out>) at ./include/Context.h:65
#10 0x00007f3d9cb0eb08 in Finisher::finisher_thread_entry (this=0x55f2de6447e0) at common/Finisher.cc:59
#11 0x00007f3d978a1464 in start_thread (arg=0x7f3d80ef8700) at pthread_create.c:334
#12 0x00007f3d975df01d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 16 (LWP 30332):
#0  pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f3d9caa7bb3 in Wait (mutex=..., this=0x7f3d817f8cf0) at ./common/Cond.h:55
#2  librbd::ImageCtx::flush_cache (this=this@entry=0x55f2de6315b0) at librbd/ImageCtx.cc:675
#3  0x00007f3d9cac235e in librbd::_flush (ictx=ictx@entry=0x55f2de6315b0) at librbd/internal.cc:3276
#4  0x00007f3d9cac986e in librbd::ictx_refresh (ictx=ictx@entry=0x55f2de6315b0) at librbd/internal.cc:2121
#5  0x00007f3d9caca43e in librbd::ictx_check (ictx=ictx@entry=0x55f2de6315b0, owner_locked=owner_locked@entry=false) at librbd/internal.cc:1893
#6  0x00007f3d9cad405e in librbd::aio_write (ictx=0x55f2de6315b0, off=38234656768, len=2560,
    buf=0x7f3d5ff1a000 "33:46 fe80::990d:c01:6268:eb6d%14 POST /XXXXXXXX/XXXXXXXXXXX/XXXX/XXXXXXXXXXXXXXXX +REQID:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XX - XXXXXXXXXXXXXXXXXXXXXXXX%14 - 200 0 0 15\r\n2016-03-04 22:33:46 fe80::"..., c=0x7f3d5ff0dc40, op_flags=0) at librbd/internal.cc:3314
#7  0x00007f3d9ca91919 in Context::complete (this=0x7f3d5ff0ddf0, r=<optimized out>) at ./include/Context.h:65
#8  0x00007f3d9caae436 in _process (item=..., this=0x55f2de633a10) at ./common/WorkQueue.h:462
#9  _process (u=..., this=0x55f2de633a10) at ./common/WorkQueue.h:170
#10 ThreadPool::WorkQueueVal<std::pair<Context*, int>, std::pair<Context*, int> >::_void_process (this=0x55f2de633a10, handle=...) at ./common/WorkQueue.h:191
#11 0x00007f3d9cbb0327 in ThreadPool::worker (this=0x55f2de6334f0, wt=0x55f2de6337d0) at common/WorkQueue.cc:128
#12 0x00007f3d9cbb13e0 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:318
#13 0x00007f3d978a1464 in start_thread (arg=0x7f3d817fa700) at pthread_create.c:334
#14 0x00007f3d975df01d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

@Roy,

From the backtrace, Ceph is waiting for a cache flush to complete. The thread responsible for handling cache flushes does not appear to be blocked. Something did cause the image watcher to lose its connection to the OSD, so perhaps there are some network issues at play?

If this is repeatable, would it be possible to gather the rbd log from the affected process from a period shortly before the hang to a period shortly after? I would like to see "debug rbd = 20" level. If you have the ASOK file configured, you can use "ceph --admin-daemon </path/to/admin/socket> config set debug_rbd 20" to bump the log level of a specific process.

Actions

Copy link

Updated by Josh Durgin about 8 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

Status changed from Need More Info to 12
Priority changed from Normal to Urgent

If an AIO request results in an image refresh due to a new snapshot creation, the cache will be flushed. When this occurs, if there is a flushed writeback request which results in a CoW copy-up from the parent image, the read will be blocked on the thread pool by the original AIO request.

Setting 'rbd_non_blocking_aio' to false should act as a workaround for the issue since the AIO request will be processed in the QEMU thread context instead of the librbd thread. Another workaround would require freezing IO before taking the snapshot to allow the writeback to complete before the snapshot is created. An extreme workaround would involve flattening the cloned image to avoid the CoW.

Actions

Copy link