Project

General

Profile

Bug #868

librbd::AioCompletion::add_block_completion causes Qemu to crash

Added by Wido den Hollander over 8 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
librbd
Target version:
Start date:
03/05/2011
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I've updated my Qemu to the latest code and am experiencing crashes on my system.

ceph version 0.25~rc (commit:46b01f4a78725642366eefe0658b368f959f45c8)
 1: (()+0x13aa0c) [0x7f086da9ca0c]
 2: (()+0xf8f0) [0x7f086e12d8f0]
 3: (gsignal()+0x35) [0x7f086a8fda75]
 4: (abort()+0x180) [0x7f086a9015c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f0869dc08e5]
 6: (()+0xcad16) [0x7f0869dbed16]
 7: (()+0xcad43) [0x7f0869dbed43]
 8: (()+0xcae3e) [0x7f0869dbee3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x397) [0x7f086da78257]
 10: (librbd::AioCompletion::add_block_completion(librbd::AioBlockCompletion*)+0x236) [0x7f086dd14de6]
 11: (librbd::aio_write(librbd::ImageCtx*, long, unsigned long, char const*, librbd::AioCompletion*)+0x20d) [0x7f086dd0cefd]
 12: /usr/bin/qemu-system-x86_64() [0x4596a2]
 13: /usr/bin/qemu-system-x86_64() [0x4596d0]
 14: /usr/bin/qemu-system-x86_64() [0x43ab73]
 15: /usr/bin/qemu-system-x86_64() [0x43ba1c]
 16: /usr/bin/qemu-system-x86_64() [0x41edd5]
 17: /usr/bin/qemu-system-x86_64() [0x41f44c]
 18: /usr/bin/qemu-system-x86_64() [0x42bf1d]
 19: /usr/bin/qemu-system-x86_64() [0x42c049]
 20: /usr/bin/qemu-system-x86_64() [0x42d441]
 21: (()+0x69ca) [0x7f086e1249ca]
 22: (clone()+0x6d) [0x7f086a9b070d]

I was just resizing a ext4 filesystem inside the VM, but I also saw this when simply updating the kernel via apt.

My librbd is from the "next" branch ( 46b01f4a78725642366eefe0658b368f959f45c8 ).

History

#1 Updated by Wido den Hollander over 8 years ago

I've attached gdb to backtrace what is going on:

(gdb) bt
#0  0x00007f77dae47a75 in raise () from /lib/libc.so.6
#1  0x00007f77dae4b5c0 in abort () from /lib/libc.so.6
#2  0x00007f77da30a8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007f77da308d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007f77da308d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007f77da308e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00007f77ddfc2257 in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, 
    line=<value optimized out>, func=<value optimized out>) at common/assert.cc:86
#7  0x00007f77de25ede6 in Mutex::Lock (this=0x7f77cd4c69f0, aio_completion=<value optimized out>) at ./common/Mutex.h:118
#8  librbd::AioCompletion::get (this=0x7f77cd4c69f0, aio_completion=<value optimized out>) at librbd.cc:182
#9  librbd::AioCompletion::add_block_completion (this=0x7f77cd4c69f0, aio_completion=<value optimized out>) at librbd.cc:164
#10 0x00007f77de256efd in librbd::aio_write (ictx=0x2e77050, off=283832725504, len=1552384, 
    buf=0x7f7778c61200 "\300;9\230", c=0x7f77cd4c69f0) at librbd.cc:1237
#11 0x00000000004596a2 in ?? ()
#12 0x00000000004596d0 in ?? ()
#13 0x000000000043ab73 in ?? ()
#14 0x000000000043ba1c in ?? ()
#15 0x000000000041edd5 in ?? ()
#16 0x000000000041f44c in ?? ()
#17 0x000000000042bf1d in ?? ()
#18 0x000000000042c049 in ?? ()
#19 0x000000000042d441 in ?? ()
#20 0x00007f77de66e9ca in start_thread () from /lib/libpthread.so.0
#21 0x00007f77daefa70d in clone () from /lib/libc.so.6
#22 0x0000000000000000 in ?? ()
(gdb)

#2 Updated by Sage Weil over 8 years ago

  • Assignee set to Josh Durgin

#3 Updated by Josh Durgin over 8 years ago

Finally managed to reproduce this using virtio and cache=writeback. I'm not sure if it'll occur in other configurations.
This happened while writing random data to files on a loopback ext4 fs in a vm.

Full backtrace:

#0  0x00007faeabb1fe2b in raise () from /lib/libpthread.so.0
#1  0x00007faeab48d9e8 in handle_fatal_signal (signum=6) at common/signal.cc:78
#2  <signal handler called>
#3  0x00007faeaa7e0165 in raise () from /lib/libc.so.6
#4  0x00007faeaa7e2f70 in abort () from /lib/libc.so.6
#5  0x00007faea9ca1dc5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#6  0x00007faea9ca0166 in ?? () from /usr/lib/libstdc++.so.6
#7  0x00007faea9ca0193 in std::terminate() () from /usr/lib/libstdc++.so.6
#8  0x00007faea9ca028e in __cxa_throw () from /usr/lib/libstdc++.so.6
#9  0x00007faeab4694cf in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, func=<value optimized out>) at common/assert.cc:86
#10 0x00007faeab708816 in Mutex::Lock (this=0x7fada4ac4320, aio_completion=<value optimized out>) at ./common/Mutex.h:118
#11 librbd::AioCompletion::get (this=0x7fada4ac4320, aio_completion=<value optimized out>) at librbd.cc:182
#12 librbd::AioCompletion::add_block_completion (this=0x7fada4ac4320, aio_completion=<value optimized out>) at librbd.cc:164
#13 0x00007faeab700d2d in librbd::aio_write (ictx=0x25baf60, off=901758976, len=1036288, 
    buf=0x7fada4895200 "\261\r\365@\311\333\065\f\302\205?\340/Y0**R?^\271\066\203z\216`\307\006\301\337\004\bw\343*+", <incomplete sequence \302>, c=0x7fada4ac4320) at librbd.cc:1237
#14 0x0000000000468df0 in rbd_aio_rw_vector (bs=0x25aa830, sector_num=1761248, qiov=0x7fadcd8115c0, nb_sectors=2024, cb=0x43674a <multiwrite_cb>, opaque=0x7fadcd811ca0, write=1) at block/rbd.c:543
#15 0x0000000000468f29 in qemu_rbd_aio_writev (bs=0x25aa830, sector_num=1761248, qiov=0x7fadcd8115c0, nb_sectors=2024, cb=0x43674a <multiwrite_cb>, opaque=0x7fadcd811ca0) at block/rbd.c:569
#16 0x0000000000436594 in bdrv_aio_writev (bs=0x25aa830, sector_num=1761248, qiov=0x7fadcd8115c0, nb_sectors=2024, cb=0x43674a <multiwrite_cb>, opaque=0x7fadcd811ca0) at block.c:2125
#17 0x0000000000436ec4 in bdrv_aio_multiwrite (bs=0x25aa830, reqs=0x7fffc52f19c0, num_reqs=2) at block.c:2339
#18 0x0000000000418659 in virtio_submit_multiwrite (bs=0x25aa830, mrb=0x7fffc52f19c0) at qemu-kvm/hw/virtio-blk.c:253
#19 0x0000000000418ccf in virtio_blk_handle_output (vdev=0x7fae9046d4b0, vq=0x7fae9046d590) at qemu-kvm/hw/virtio-blk.c:383
#20 0x0000000000572793 in virtio_queue_notify_vq (vq=0x7fae9046d590) at qemu-kvm/hw/virtio.c:583
#21 0x00000000005727f5 in virtio_queue_notify (vdev=0x7fae9046d4b0, n=0) at qemu-kvm/hw/virtio.c:590
#22 0x00000000004204a0 in virtio_ioport_write (opaque=0x7fae9046cb90, addr=16, val=0) at qemu-kvm/hw/virtio-pci.c:358
#23 0x0000000000420a94 in virtio_pci_config_writew (opaque=0x7fae9046cb90, addr=16, val=0) at qemu-kvm/hw/virtio-pci.c:502
#24 0x0000000000482d21 in ioport_write (index=1, address=49296, data=0) at ioport.c:81
#25 0x00000000004836b7 in cpu_outw (addr=49296, val=0) at ioport.c:271
#26 0x0000000000428742 in kvm_handle_io (port=49296, data=0x7fae97b79000, direction=1, size=2, count=1) at qemu-kvm/kvm-all.c:802
#27 0x0000000000428b5b in kvm_cpu_exec (env=0x7fae900059d0) at qemu-kvm/kvm-all.c:943
#28 0x0000000000502cd1 in cpu_x86_exec (env1=0x7fae900059d0) at qemu-kvm/cpu-exec.c:344
#29 0x0000000000407da7 in qemu_cpu_exec (env=0x7fae900059d0) at qemu-kvm/cpus.c:896
#30 0x0000000000407eae in cpu_exec_all () at qemu-kvm/cpus.c:924
#31 0x0000000000568ff2 in main_loop () at qemu-kvm/vl.c:1419
#32 0x000000000056d00d in main (argc=48, argv=0x7fffc52f2698, envp=0x7fffc52f2820) at qemu-kvm/vl.c:3136

I'm still looking into the root cause.

#4 Updated by Yehuda Sadeh over 8 years ago

hmm.. looking at the librbd::aio_write(), we loop through the blocks, creating a completion and triggering a write for each. My guess is that the librbd::AioCompletion() is getting released due to some bug, e.g., we don't hold an extra reference on it while doing that and we got responses for all the requests that we sent so far (although we didn't send everything). So maybe a simple c->get() before the loop and c->put() after would fix that?

#5 Updated by Josh Durgin over 8 years ago

I noticed prior to crashing like this, qemu would consume lots of memory. I fixed a bunch of memory leaks in aio handling, which seems to have fixed the issue (last relevant commit was 52542be8e038c3d88a3e850f4e141e20fb31987d). I'm running I/O in a vm overnight in case there are still any slow leaks.

#6 Updated by Wido den Hollander over 8 years ago

Seems to be working here too. I didn't notice the memory usage which you are describing, but I could have missed it.

I just wrote 20GB inside a VM with dd, no crashes or whatsoever, seems to be resolved.

#7 Updated by Wido den Hollander over 8 years ago

Let me get back on that. While 20GB went fine, 100GB did not.

In my logs I found:

common/Mutex.h: In function 'void Mutex::Lock(bool)', in thread '0x7fca6e9d9700'
common/Mutex.h: 118: FAILED assert(r == 0)
common/Mutex.h: In function 'void Mutex::Lock(bool)', in thread '0x7fca6e9d9700'
common/Mutex.h: 118: FAILED assert(r == 0)
 ceph version 0.26~rc (commit:4ee75a881ec637e2b0c5b74b16b1e44ac710707c)
 1: (librbd::AioCompletion::add_block_completion(librbd::AioBlockCompletion*)+0x236) [0x7fca799c6fd6]
 2: (librbd::aio_write(librbd::ImageCtx*, unsigned long, unsigned long, char const*, librbd::AioCompletion*)+0x20d) [0x7fca799c308d]
..
..

#8 Updated by Sage Weil over 8 years ago

  • Target version set to v0.27
  • translation missing: en.field_position set to 322

#9 Updated by Sage Weil over 8 years ago

  • translation missing: en.field_story_points set to 1
  • translation missing: en.field_position deleted (323)
  • translation missing: en.field_position set to 6

#10 Updated by Wido den Hollander over 8 years ago

I've got my new cluster running now with all the latest code (Ceph and Qemu).

I've been doing some heavy writing in a VM today, but so far (about ~100GB written) I haven't seen any crashes of Qemu.

#11 Updated by Sage Weil over 8 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF