Project

General

Profile

Bug #877

Qemu refuses to start with multiple RBD disks

Added by Wido den Hollander over 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
librados
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When trying to start a VM with multiple RBD disks attached, this fails, I keep getting a "hunting for new mon"

root@client:~# /usr/bin/qemu-system-x86_64 -S -M pc-0.12 -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name sdgoij -uuid 2e718c9f-7cd5-405e-91ed-d613a58dafdb -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/sdgoij.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot c -drive file=rbd:rbd/sdgoij,if=none,id=drive-virtio-disk0,boot=on,format=rbd,cache=writeback -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=rbd:rbd/sdgoij-data,if=none,id=drive-virtio-disk1,format=rbd -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=rbd:rbd/sdgoij-swap,if=none,id=drive-virtio-disk2,format=rbd,cache=writeback -device virtio-blk-pci,bus=pci.0,addr=0x6,drive=drive-virtio-disk2,id=virtio-disk2 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
char device redirected to /dev/pts/1
2011-03-11 16:02:47.805726 7ff2aa32c780 librados: starting msgr at :/0
2011-03-11 16:02:47.805763 7ff2aa32c780 librados: starting objecter
2011-03-11 16:02:47.805913 7ff2aa32c780 librados: setting wanted keys
2011-03-11 16:02:47.805934 7ff2aa32c780 librados: calling monclient init
2011-03-11 16:02:47.810456 7ff2aa32c780 librados: waiting for osdmap
2011-03-11 16:02:47.811167 7ff2aa32c780 librados: init done
2011-03-11 16:02:47.812791 7ff2aa32c780 librados: Objecter returned from read r=0
2011-03-11 16:02:47.812824 7ff2aa32c780 librados: Returned length 112 less than original length 4096
2011-03-11 16:02:47.834562 7ff2aa32c780 librados: starting msgr at :/0
2011-03-11 16:02:47.834589 7ff2aa32c780 librados: starting objecter
2011-03-11 16:02:47.834676 7ff2aa32c780 librados: setting wanted keys
2011-03-11 16:02:47.834696 7ff2aa32c780 librados: calling monclient init
2011-03-11 16:02:47.835632 7ff2a05bc700 monclient: hunting for new mon
2011-03-11 16:02:50.835770 7ff2a05bc700 monclient: hunting for new mon
2011-03-11 16:02:53.835870 7ff2a05bc700 monclient: hunting for new mon
2011-03-11 16:02:56.835952 7ff2a05bc700 monclient: hunting for new mon
2011-03-11 16:02:59.836077 7ff2a05bc700 monclient: hunting for new mon
^C
root@client:~#
This is a VM with 3 disks:
  • rbd:rbd/sdgoij
  • rbd:rbd/sdgoij-data
  • rbd:rbd/sdgoij-swap

When starting the VM with just 'rbd:rbd/sdgoij' it starts fine:

root@client:~# /usr/bin/qemu-system-x86_64 -S -M pc-0.12 -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name sdgoij -uuid 2e718c9f-7cd5-405e-91ed-d613a58dafdb -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/sdgoij.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot c -drive file=rbd:rbd/sdgoij,if=none,id=drive-virtio-disk0,boot=on,format=rbd,cache=writeback -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
char device redirected to /dev/pts/1
2011-03-11 16:03:15.685592 7fd3840e1780 librados: starting msgr at :/0
2011-03-11 16:03:15.685627 7fd3840e1780 librados: starting objecter
2011-03-11 16:03:15.685813 7fd3840e1780 librados: setting wanted keys
2011-03-11 16:03:15.685834 7fd3840e1780 librados: calling monclient init
2011-03-11 16:03:15.688455 7fd3840e1780 librados: waiting for osdmap
2011-03-11 16:03:15.690975 7fd3840e1780 librados: init done
2011-03-11 16:03:15.692661 7fd3840e1780 librados: Objecter returned from read r=0
2011-03-11 16:03:15.692693 7fd3840e1780 librados: Returned length 112 less than original length 4096
^C
root@client:~#

I've ran it through GDB and found:

(gdb) bt
#0  0x00007ffff4195a75 in raise () from /lib/libc.so.6
#1  0x00007ffff41995c0 in abort () from /lib/libc.so.6
#2  0x00007ffff36588e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007ffff3656d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007ffff3656d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007ffff3656e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00007ffff7311737 in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, func=<value optimized out>) at common/assert.cc:86
#7  0x00007ffff727025b in Thread::join (this=0x7fffe800cbc0, prval=<value optimized out>) at ./common/Thread.h:99
#8  0x00007ffff7282747 in SimpleMessenger::wait (this=0x7fffe800c790) at msg/SimpleMessenger.cc:2719
#9  0x00007ffff7253fd0 in librados::RadosClient::shutdown (this=0x7fffe8009640) at librados.cc:611
#10 0x00007ffff72540b9 in rados_shutdown (cluster=0x4f64) at librados.cc:2639
...
...

It seems not to be a Qemu nor librbd issue, but librados.

My Ceph version: 67b6187590a9e7876cb9e5aad67cac7032e46752

sdgoij.xml View - libvirt configuration (1.56 KB) Wido den Hollander, 03/11/2011 07:12 AM

Associated revisions

Revision 7670b264 (diff)
Added by Sage Weil about 10 years ago

msgr: let user explicitly set nonce

There will be problems if two messengers use the same entity_addr_t because
they are on the same ip and choose the same nonce (e.g., because they are
in the same process). Let the caller sort this out in whatever way it
finds most appropriate.

For libceph, librados, and csyn, all N million to the pid.

Fixes: #877
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 10 years ago

  • Project changed from qemu-rbd to Ceph
  • Priority changed from Normal to High

#2 Updated by Sage Weil over 10 years ago

  • Category set to librados
  • Target version set to v0.25.2

#3 Updated by Sage Weil over 10 years ago

  • translation missing: en.field_position set to 1

#4 Updated by Sage Weil over 10 years ago

  • translation missing: en.field_story_points set to 2
  • translation missing: en.field_position deleted (1)
  • translation missing: en.field_position set to 1

#5 Updated by Josh Durgin about 10 years ago

Here's a snippet of this occurring with objecter and messenger debugging:

2011-03-16 15:45:46.771273 7fd93a65c710 monclient: hunting for new mon
2011-03-16 15:45:46.771366 7fd93a65c710 -- 10.0.1.247:0/19402 mark_down 10.0.1.247:6789/0 -- pipe dne
2011-03-16 15:45:46.771437 7fd93a65c710 -- 10.0.1.247:0/19402 --> mon1 10.0.1.247:6790/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x1bae010
2011-03-16 15:45:46.772550 7fd933fff710 -- 10.0.1.247:0/19402 <== mon0 10.0.1.247:6789/0 1 ==== auth_reply(proto 1 0 Success) v1 ==== 24+0+0 (1833333342 0 0) 0x1bae010 con 0x1bb8dd0
2011-03-16 15:45:46.772621 7fd933fff710 -- 10.0.1.247:0/19402 --> mon0 10.0.1.247:6789/0 -- mon_subscribe({monmap=0+}) v1 -- ?+0 0x1bba3d0
2011-03-16 15:45:46.772729 7fd943d39780 client4152.objecter maybe_request_map subscribing (onetime) to next osd map
2011-03-16 15:45:46.772772 7fd943d39780 -- 10.0.1.247:0/19402 --> mon0 10.0.1.247:6789/0 -- mon_subscribe({monmap=0+,osdmap=0}) v1 -- ?+0 0x1bba3d0
2011-03-16 15:45:46.772829 7fd943d39780 -- 10.0.1.247:0/19402 --> mon0 10.0.1.247:6789/0 -- mon_subscribe({monmap=0+,osdmap=0}) v1 -- ?+0 0x1bb9000
2011-03-16 15:45:46.773010 7fd93a65c710 -- 10.0.1.247:0/19402 <== mon1 10.0.1.247:6790/0 1 ==== auth_reply(proto 1 0 Success) v1 ==== 24+0+0 (3500932204 0 0) 0x1bb91e0 con 0x1bba7e0
2011-03-16 15:45:46.773059 7fd93a65c710 -- 10.0.1.247:0/19402 --> mon1 10.0.1.247:6790/0 -- mon_subscribe({monmap=2+}) v1 -- ?+0 0x1bba9c0
2011-03-16 15:45:46.773727 7fd93a65c710 -- 10.0.1.247:0/19402 <== mon1 10.0.1.247:6790/0 2 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2018052612 0 0) 0x1bba3d0 con 0x1bba7e0
2011-03-16 15:45:46.773874 7fd933fff710 -- 10.0.1.247:0/19402 <== mon0 10.0.1.247:6789/0 2 ==== mon_map v1 ==== 469+0+0 (3811484495 0 0) 0x1bb93c0 con 0x1bb8dd0
2011-03-16 15:45:46.774122 7fd933fff710 -- 10.0.1.247:0/19402 <== mon0 10.0.1.247:6789/0 3 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2018052612 0 0) 0x1bb9000 con 0x1bb8dd0
2011-03-16 15:45:46.774490 7fd933fff710 -- 10.0.1.247:0/19402 <== mon0 10.0.1.247:6789/0 4 ==== mon_map v1 ==== 469+0+0 (3811484495 0 0) 0x1bb93c0 con 0x1bb8dd0
2011-03-16 15:45:46.774945 7fd933fff710 -- 10.0.1.247:0/19402 <== mon0 10.0.1.247:6789/0 5 ==== osd_map(2,2) v1 ==== 1220+0+0 (2257482185 0 0) 0x1bb96b0 con 0x1bb8dd0
2011-03-16 15:45:46.774971 7fd933fff710 client4152.objecter handle_osd_map got epochs [2,2] > 0
2011-03-16 15:45:46.774988 7fd933fff710 client4152.objecter handle_osd_map decoding full epoch 2
2011-03-16 15:45:46.775301 7fd943d39780 client4152.objecter recalc_op_target tid 1 pgid 3.9774 acting [0]
2011-03-16 15:45:46.775418 7fd943d39780 client4152.objecter op_submit oid extra.rbd @3 [read 0~4096] tid 1 osd0
2011-03-16 15:45:46.775456 7fd943d39780 -- 10.0.1.247:0/19402 --> 10.0.1.247:6801/26709 -- osd_op(client4152.0:1 extra.rbd [read 0~4096] 3.9774) v1 -- ?+0 0x1bbb0f0 con 0x1bb9660
2011-03-16 15:45:46.775480 7fd943d39780 client4152.objecter 1 unacked, 0 uncommitted
2011-03-16 15:45:46.775731 7fd933fff710 -- 10.0.1.247:0/19402 <== mon0 10.0.1.247:6789/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2018052612 0 0) 0x7fd934000930 con 0x1bb8dd0
2011-03-16 15:45:46.776182 7fd93a65c710 client4149.objecter ms_handle_reset on osd0
2011-03-16 15:45:46.776255 7fd933fff710 -- 10.0.1.247:0/19402 <== mon0 10.0.1.247:6789/0 7 ==== mon_map v1 ==== 469+0+0 (3811484495 0 0) 0x7fd934000d10 con 0x1bb8dd0
2011-03-16 15:45:46.776308 7fd93a65c710 client4149.objecter reopen_session osd0 session, addr now osd0 10.0.1.247:6801/26709
2011-03-16 15:45:46.776349 7fd93a65c710 -- 10.0.1.247:0/19402 mark_down 0x1baef90 -- pipe dne
2011-03-16 15:45:46.776571 7fd93a65c710 client4149.objecter kick_requests for osd0
2011-03-16 15:45:46.776609 7fd93a65c710 client4149.objecter maybe_request_map subscribing (onetime) to next osd map
2011-03-16 15:45:46.776660 7fd933fff710 -- 10.0.1.247:0/19402 <== mon0 10.0.1.247:6789/0 8 ==== osd_map(2,2) v1 ==== 1220+0+0 (2257482185 0 0) 0x7fd934000a50 con 0x1bb8dd0
2011-03-16 15:45:46.776702 7fd93a65c710 -- 10.0.1.247:0/19402 --> mon1 10.0.1.247:6790/0 -- mon_subscribe({monmap=2+,osdmap=3}) v1 -- ?+0 0x7fd92c000da0
2011-03-16 15:45:46.776740 7fd933fff710 client4152.objecter handle_osd_map ignoring epochs [2,2] <= 2
2011-03-16 15:45:46.776785 7fd93a65c710 client4149.objecter ms_handle_reset on osd0
2011-03-16 15:45:46.776833 7fd93a65c710 client4149.objecter reopen_session osd0 session, addr now osd0 10.0.1.247:6801/26709
2011-03-16 15:45:46.776857 7fd93a65c710 -- 10.0.1.247:0/19402 mark_down 0x7fd92c000b20 -- 0x7fd92c0008b0
2011-03-16 15:45:46.776887 7fd933fff710 -- 10.0.1.247:0/19402 <== mon0 10.0.1.247:6789/0 9 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2018052612 0 0) 0x1bbb0f0 con 0x1bb8dd0
2011-03-16 15:45:46.776973 7fd93a65c710 client4149.objecter kick_requests for osd0
2011-03-16 15:45:46.776996 7fd93a65c710 client4149.objecter maybe_request_map subscribing (onetime) to next osd map
2011-03-16 15:45:46.778402 7fd93a65c710 -- 10.0.1.247:0/19402 <== mon1 10.0.1.247:6790/0 3 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2018052612 0 0) 0x1baee40 con 0x1bba7e0
2011-03-16 15:45:46.778525 7fd933fff710 client4152.objecter ms_handle_reset on osd0
2011-03-16 15:45:46.778574 7fd933fff710 client4152.objecter reopen_session osd0 session, addr now osd0 10.0.1.247:6801/26709
2011-03-16 15:45:46.778590 7fd933fff710 -- 10.0.1.247:0/19402 mark_down 0x1bb9660 -- pipe dne
2011-03-16 15:45:46.778644 7fd933fff710 client4152.objecter kick_requests for osd0

#6 Updated by Josh Durgin about 10 years ago

The default nonce (pid) used by SimpleMessenger was the problem. We'll have to figure out what we want to set it to.

#7 Updated by Sage Weil about 10 years ago

  • Status changed from New to 7
  • Assignee set to Sage Weil

Pushed fixes for stable and master branch.

#8 Updated by Josh Durgin about 10 years ago

  • Status changed from 7 to Resolved

The fix worked.

Also available in: Atom PDF