Project

General

Profile

Actions

Bug #4446

closed

librbd: crash from opensolaris vm

Added by Jeff Moskow about 11 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have about 60 VM's using RBD. All of them are working fine except for one that's used with Solaris 10. I have verified by importing and exporting the image that the image is correct in the ceph system. I was advised to log the errors and to create a ceph debug log, both of which I'm attaching to this bug report.

Please let me know if you need additional information or if there are other tests that you'd like me to run.

Thanks,
Jeff


Files

client-errors.PNG (10.9 KB) client-errors.PNG Jeff Moskow, 03/14/2013 04:12 PM
rbd.648258.log (10.1 MB) rbd.648258.log Jeff Moskow, 03/14/2013 04:15 PM
Actions #1

Updated by Jeff Moskow about 11 years ago

I just realized that the log file didn't get attached.

Actions #2

Updated by Ian Colle about 11 years ago

  • Assignee set to Josh Durgin
  • Priority changed from Normal to High
Actions #3

Updated by Josh Durgin about 11 years ago

The error the guest is seeing is not actually a short read - it's just a bad error message from the solaris disk driver, when it's casting an error code (-1) to an unsigned read length.

I couldn't reproduce the problem with a fresh solaris vm, and I'm not sure of the source of the error that the guest sees. The log shows no error from the osds, and all the reads are within the device and seem to be returned successfully.

The guest reports sector 31701033 is the one with the error, but the reads rbd sees go sequentially for a while up through sector 3170132, but rbd sees no read for the sector the guest reports the error for. This suggests that qemu is returning an error for that sector for some reason, or there's a sector offset I don't see here.

Can you verify that this solaris vm works under qemu without rbd involved, running off a regular file instead?

Actions #4

Updated by Jeff Moskow about 11 years ago

Yes, the same disk image boots and runs just fine from local storage (that's how we're running it now). FYI - here is what "uname -a" report on that running image:

SunOS truffle 5.10 Generic i86pc i386 i86pc
Actions #5

Updated by Josh Durgin almost 11 years ago

Sorry for the delay. I've learned that Solaris sector counts could start at 1 instead of 0, so rbd did at least see the read (but seems to have completed it successfully).

To figure out what's going on will probably require modifying qemu to add more tracing. If you can share the solaris vm image that would be ideal. If not, could you tell me what version of qemu you're using so I can give you a patch to add debug output?

Actions #6

Updated by Jeff Moskow almost 11 years ago

Thanks for continuing to pursue this.

I can send you the image (about 20GB), but it may have issues booting (dependencies on NIS, NFS, etc). If you'd rather have me test here, pveversion -v reports:

qemu-server: 2.3-20
pve-qemu-kvm: 1.4-10

If you do want me to send you the 20GB file, what is the best way to get it to you?

Actions #7

Updated by Josh Durgin almost 11 years ago

Even without NIS or NFS, I'm guessing it'll get far enough to hit the error. I'll email you a place to upload the image.

Actions #8

Updated by Dan Mick almost 11 years ago

As an ex-Sun employee, I can point out that this is an ancient version of S10; there've been many many updates since then, some of which change the boot process significantly (I could barely remember how this one
worked). It's entirely possible that performing an upgrade would make this work better; it would certainly make it work differently.

That's not to say that there might not be some generic problems to uncover, but just to say that the situation
could be affected drastically outside the realm of Ceph/RBD.

Actions #9

Updated by Josh Durgin almost 11 years ago

I tried booting in several configurations, and couldn't get it to fail. I used ceph 0.56.4, and qemu 1.0 for ubuntu 12.04 as well as recent qemu master from git, with rbd caching enabled and disabled. My command line was:

qemu-system-i386 -enable-kvm -drive format=raw,file=rbd:volumes/solaris:id=volumes,cache=none,if=ide,id=root -m 1024 -vga cirrus -vnc 0.0.0.0:0

Since the original rbd log you generated showed no I/O errors at the rbd level, this may be an issue with the specific version of qemu that you have.
Could you try with the qemu-kvm package from ubuntu (or from upstream) using the same command line as above? It's also possible there are other devices that are changing the guest's behavior the way you're running it. What does your qemu/kvm command line end up being the way you normally run this vm?

Actions #10

Updated by Josh Durgin almost 11 years ago

  • Status changed from New to Need More Info
Actions #11

Updated by Jeff Moskow almost 11 years ago

I've upgraded to Cuttlefish and the newest Promox (KVM 1.4.1) and still have the same problem. The kvm command is:

/usr/bin/kvm -id 157 -chardev socket,id=qmp,path=/var/run/qemu-server/157.qmp,server,nowait -mon chardev=qmp,mode=control -vnc unix:/var/run/qemu-server/157.vnc,x509,password -pidfile /var/run/qemu-server/157.pid -daemonize -name truffle -smp sockets=1,cores=1 -nodefaults -boot menu=on -vga cirrus -k en-us -m 512 -cpuunits 1000 -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=tablet,bus=uhci.0,port=1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -drive file=rbd:rbd/vm-157-disk-1:mon_host=172.16.170.1\:6789\;172.16.170.2\:6789\;172.16.170.3\:6789:id=admin:auth_supported=cephx:keyring=/etc/pve/priv/ceph/cephcluster.keyring,if=none,id=drive-ide0,aio=native  -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=100 -drive if=none,id=drive-ide2,media=cdrom,aio=native -device ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200 -netdev type=tap,id=net0,ifname=tap157i0,script=/var/lib/qemu-server/pve-bridge -device rtl8139,romfile=,mac=92:C8:C2:14:5F:5F,netdev=net0,bus=pci.0,addr=0x12,id=net0 -machine accel=tcg

Actions #12

Updated by Jeff Moskow almost 11 years ago

I just upgraded to KVM 1.4.2 -- same problem.

Actions #13

Updated by Sage Weil over 10 years ago

  • Priority changed from High to Normal
Actions #14

Updated by Sage Weil over 10 years ago

  • Subject changed from Problem with RBD Virtual Disk/Proxmox to librbd: crash from opensolaris vm
  • Assignee deleted (Josh Durgin)
Actions #15

Updated by Jason Dillaman almost 8 years ago

  • Status changed from Need More Info to Closed

2 years since last update -- closing.

Actions

Also available in: Atom PDF