Project

General

Profile

Actions

Bug #3521

closed

windows 2008 kvm guest crashes with "floating point exception" when using rbd image with cache=writeback

Added by Corin Langosch over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Host: ubuntu 12.10 amd64
Guest: windows 2008 r2
Ceph: 0.48.argonaut2 (the one from the ubuntu repos)

The crash is 100% reproducable on my test system using the following commands:

  1. kvm crashes with "floating point exception"
    kvm -cpu kvm64 -smp sockets=1,cores=4 -m 2048 -vnc 192.168.0.250:2 -usbdevice tablet -nodefaults -boot menu=on -vga cirrus -device ich9-ahci,id=ahci -drive id=drive-286,if=none,cache=writeback,aio=native,format=raw,media=disk,file=rbd:hdd/9686a9e6-a495-4ec2-9418-ab2fe87f11cd -device ide-hd,id=drive-device-286,bus=ahci.0,drive=drive-286
  1. export image to normal file
    rbd export rbd:hdd/9686a9e6-a495-4ec2-9418-ab2fe87f11cd /xfs-drive1/windows1.hdd
  1. works fine (file instead of rbd image)
    kvm -cpu kvm64 -smp sockets=1,cores=4 -m 2048 -vnc 192.168.0.250:2 -usbdevice tablet -nodefaults -boot menu=on -vga cirrus -device ich9-ahci,id=ahci -drive id=drive-286,if=none,cache=writeback,aio=native,format=raw,media=disk,file=/xfs-drive1/windows1.hdd -device ide-hd,id=drive-device-286,bus=ahci.0,drive=drive-286
  1. works fine (using rbd with cache=none)
    kvm -cpu kvm64 -smp sockets=1,cores=4 -m 2048 -vnc 192.168.0.250:2 -usbdevice tablet -nodefaults -boot menu=on -vga cirrus -device ich9-ahci,id=ahci -drive id=drive-286,if=none,cache=none,aio=native,format=raw,media=disk,file=rbd:hdd/9686a9e6-a495-4ec2-9418-ab2fe87f11cd -device ide-hd,id=drive-device-286,bus=ahci.0,drive=drive-286

The gdb backtrace shows (full output at http://pastie.org/5422842):

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fffd97fa700 (LWP 6998)]
0x00007ffff74a6e2e in librbd::AioCompletion::complete() () from /usr/lib/librbd.so.1
(gdb) backtrace
#0 0x00007ffff74a6e2e in librbd::AioCompletion::complete() () from /usr/lib/librbd.so.1
#1 0x00007ffff74a6f4a in librbd::AioCompletion::finish_adding_completions() () from /usr/lib/librbd.so.1
#2 0x00007ffff749f561 in librbd::aio_read(librbd::ImageCtx*, unsigned long, unsigned long, char*, librbd::AioCompletion*) () from /usr/lib/librbd.so.1
...

Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from New to 12
  • Source changed from Development to Community (user)

The problem is that qemu doesn't save the floating point state when calling into the storage library code, and the internal rbd instrumentation uses floats to track operation latency.

I'm introducing a 'perf' config option to disable the internal instrumentation as a workaround. I'm not sure what the cleanest fix is.

Josh, is adding 'perf = false' to ceph.conf enough, or does it need to be fed to librbd via the qemu argument?

Actions #2

Updated by Yehuda Sadeh over 11 years ago

Any reason why this have to be float? We can keep elapsed time in nanoseconds instead.

Actions #3

Updated by Sage Weil over 11 years ago

Yehuda Sadeh wrote:

Any reason why this have to be float? We can keep elapsed time in nanoseconds instead.

Good point.. every user is using the float interface for durations. That's a bigger cleanup, though... lots of callers.

Actions #4

Updated by Josh Durgin over 11 years ago

It's more than just perfcounters that use floating point. Crush does too. It might be less likely to crash from crush than perfcounters though.

Actions #5

Updated by Yehuda Sadeh over 11 years ago

It's a matter of correctness, not of probability. Not sure though if crush is being called on the library called thread context, or on a different thread (after requests were queued first).

Actions #6

Updated by Sage Weil over 11 years ago

Josh Durgin wrote:

It's more than just perfcounters that use floating point. Crush does too. It might be less likely to crash from crush than perfcounters though.

CRUSH uses fixed point... there's no floating support in the kernel.

Actions #7

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to 7

I've pushed a branch wip-perf that avoid floating point... can you give it a go?

Actions #8

Updated by Sage Weil over 11 years ago

merged into next

Actions #9

Updated by Sage Weil over 11 years ago

  • Status changed from 7 to Resolved
Actions #10

Updated by Corin Langosch over 11 years ago

Sorry for the delay. I just installed 0.55 and it doesn't crash anymore (I didn't change the ceph.conf at all). Thanks for the quick fix! :)

Actions

Also available in: Atom PDF