QEMU VM hangs talking to RBD via librbd
I'm having a repeatable hang in QEMU (v2.4.1) talking to a Ceph (v0.95.4) RBD via librbd.
I've acquired a core file and have all the binaries compiled with -g3 -ggdb3-level of debugging information.
Attached is a backtrace for all threads.
The VM process is still awake, but QEMU will not accept() the monitor socket, no network or RBD I/O happens.
This occurs after a few hours of running on this VM. The VM guest is a Microsoft Windows 2008 R2 connected to RBD via IDE emulation in QEMU.
#2 Updated by Roy Keene almost 4 years ago
I can try with a newer Ceph, but #13726 seems to be dependant upon RBD cache being disabled but it should be enabled for this VM:
2016-03-05 03:11:20.297+0000: starting up libvirt version: 1.2.21, qemu version: 2.4.1 LC_ALL=C PATH=/bin:/usr/bin QEMU_AUDIO_DRV=none /bin/qemu-system-x86_64 -name one-80 -S -machine pc-i440fx-2.4,accel=kvm,usb=off -m 8192 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 04898cdc-6332-4688-be9f-c92ba4661790 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-80/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=rbd:rbd/one-27-80-0:auth_supported=none,if=none,id=drive-ide0-0-0,format=raw,cache=writeback -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,fd=14,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=02:00:b2:fb:54:89,bus=pci.0,addr=0x3 -vnc 0.0.0.0:80 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -sandbox off -msg timestamp=on qemu: terminating on signal 15 from pid 7419 2016-03-05 03:32:19.756+0000: shutting down
[client] rbd cache = true rbd cache size = 67108864 rbd cache max dirty age = 60
Is the caching being disabled an actual requirement for this bug, or just a coincidence for the reporter ?
I will upgrade and test but it's somewhat difficult since it's just this one VM that triggers it and it's only periodically
#4 Updated by Jason Dillaman almost 4 years ago
- Project changed from Ceph to rbd
- Category deleted (
Thread 17 (LWP 30339): #0 0x00007f3d978a6378 in futex_wait (private=0, expected=0, futex_word=0x55f2de63175c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 #1 futex_wait_simple (private=0, expected=0, futex_word=0x55f2de63175c) at ../sysdeps/nptl/futex-internal.h:135 #2 __pthread_rwlock_wrlock_slow (rwlock=0x55f2de631750) at pthread_rwlock_wrlock.c:67 #3 0x00007f3d9cabd214 in RWLock::get_write (this=0x55f2de631748, lockdep=<optimized out>) at ./common/RWLock.h:96 #4 0x00007f3d9cab8745 in WLocker (lock=..., this=<synthetic pointer>) at ./common/RWLock.h:152 #5 librbd::ImageWatcher::reregister_watch (this=0x55f2de644240) at librbd/ImageWatcher.cc:1000 #6 0x00007f3d9ca98fda in operator() (a0=<optimized out>, this=<optimized out>) at /home/rkeene/devel/aurae/node/root/packages/ceph/workdir-129422584788648997066/.fakeroot/include/boost/function/function_template.hpp:767 #7 FunctionContext::finish (this=<optimized out>, r=<optimized out>) at ./include/Context.h:461 #8 0x00007f3d9ca91919 in Context::complete (this=0x7f3b5002f6b0, r=<optimized out>) at ./include/Context.h:65 #9 0x00007f3d9ca91919 in Context::complete (this=0x7f3b5001c880, r=<optimized out>) at ./include/Context.h:65 #10 0x00007f3d9cb0eb08 in Finisher::finisher_thread_entry (this=0x55f2de6447e0) at common/Finisher.cc:59 #11 0x00007f3d978a1464 in start_thread (arg=0x7f3d80ef8700) at pthread_create.c:334 #12 0x00007f3d975df01d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 16 (LWP 30332): #0 pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f3d9caa7bb3 in Wait (mutex=..., this=0x7f3d817f8cf0) at ./common/Cond.h:55 #2 librbd::ImageCtx::flush_cache (this=this@entry=0x55f2de6315b0) at librbd/ImageCtx.cc:675 #3 0x00007f3d9cac235e in librbd::_flush (ictx=ictx@entry=0x55f2de6315b0) at librbd/internal.cc:3276 #4 0x00007f3d9cac986e in librbd::ictx_refresh (ictx=ictx@entry=0x55f2de6315b0) at librbd/internal.cc:2121 #5 0x00007f3d9caca43e in librbd::ictx_check (ictx=ictx@entry=0x55f2de6315b0, owner_locked=owner_locked@entry=false) at librbd/internal.cc:1893 #6 0x00007f3d9cad405e in librbd::aio_write (ictx=0x55f2de6315b0, off=38234656768, len=2560, buf=0x7f3d5ff1a000 "33:46 fe80::990d:c01:6268:eb6d%14 POST /XXXXXXXX/XXXXXXXXXXX/XXXX/XXXXXXXXXXXXXXXX +REQID:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XX - XXXXXXXXXXXXXXXXXXXXXXXX%14 - 200 0 0 15\r\n2016-03-04 22:33:46 fe80::"..., c=0x7f3d5ff0dc40, op_flags=0) at librbd/internal.cc:3314 #7 0x00007f3d9ca91919 in Context::complete (this=0x7f3d5ff0ddf0, r=<optimized out>) at ./include/Context.h:65 #8 0x00007f3d9caae436 in _process (item=..., this=0x55f2de633a10) at ./common/WorkQueue.h:462 #9 _process (u=..., this=0x55f2de633a10) at ./common/WorkQueue.h:170 #10 ThreadPool::WorkQueueVal<std::pair<Context*, int>, std::pair<Context*, int> >::_void_process (this=0x55f2de633a10, handle=...) at ./common/WorkQueue.h:191 #11 0x00007f3d9cbb0327 in ThreadPool::worker (this=0x55f2de6334f0, wt=0x55f2de6337d0) at common/WorkQueue.cc:128 #12 0x00007f3d9cbb13e0 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:318 #13 0x00007f3d978a1464 in start_thread (arg=0x7f3d817fa700) at pthread_create.c:334 #14 0x00007f3d975df01d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
#5 Updated by Jason Dillaman almost 4 years ago
From the backtrace, Ceph is waiting for a cache flush to complete. The thread responsible for handling cache flushes does not appear to be blocked. Something did cause the image watcher to lose its connection to the OSD, so perhaps there are some network issues at play?
If this is repeatable, would it be possible to gather the rbd log from the affected process from a period shortly before the hang to a period shortly after? I would like to see "debug rbd = 20" level. If you have the ASOK file configured, you can use "ceph --admin-daemon </path/to/admin/socket> config set debug_rbd 20" to bump the log level of a specific process.
#7 Updated by Jason Dillaman almost 4 years ago
- Status changed from Need More Info to 12
- Priority changed from Normal to Urgent
If an AIO request results in an image refresh due to a new snapshot creation, the cache will be flushed. When this occurs, if there is a flushed writeback request which results in a CoW copy-up from the parent image, the read will be blocked on the thread pool by the original AIO request.
Setting 'rbd_non_blocking_aio' to false should act as a workaround for the issue since the AIO request will be processed in the QEMU thread context instead of the librbd thread. Another workaround would require freezing IO before taking the snapshot to allow the writeback to complete before the snapshot is created. An extreme workaround would involve flattening the cloned image to avoid the CoW.